Measures of lexical similarity

Introduction

As a scholar of Papuan languages, I am often overwhelmed by the sheer number of languages in that area. There is so much research to be done for these approximately 800 languages and the clock is constantly ticking, since a great deal of languages are endangered, giving way for the national languages Indonesian and Tok Pisin. If only we could perform our routine tasks more efficiently, we could invest the saved time in new languages and variaties. Luckily, we live in the age of AI and computer programming, so chances are good that some of the repetitive tasks can be delegated easily. In this article, I’d like to introduce some ways to compare wordlists that one has collected in the field. This comes in handy if you have one of the following goals:

Compare wordlists of several dialects to find out what proportion of the vocabulary is shared
Compare the wordlist you collect with lists from someone else, e.g. from legacy material from a missionary or government official
Compare wordlists from related languages to learn reconstruct proto-forms
Compare wordlists from neighbouring langauges to learn more about borrowing and language contact
and many more

In this article, I present you some tinkering I have done and give you some Python code so you can try it yourself. The source code of the project can be found here.

Getting started

So of course, we want to delegate uninteresting parts of the work to the computer. In my last three field trips, I have collected four wordlists from neighboring languages with the shortest having around 2.000 entries. I really don’t want to iterate them manually to find out what are possible cognates/borrowings. The later distinction (cognate vs. borrowing) depends on whether the languages are related but of course in many scenarios we cannot be sure about genealogical relationsships and/or possible contact scenarios. That’s why we start from scratch by comparing simple wordlists. But how can we do that?

What we need first is a good measure of similarity that can tell us whether two word forms are sufficiently similar to be possibly related. We’d like to compare each word of list 1 with each word of list 2. Possible candidates for cognates shall be scrutinized manually by the linguist. So what is a good measure of similarity?

I remember having read about the Dice-coefficient somewhere but other than the name nothing is left in my memory. So I asked the AI-search engine Perplexity.ai that came up with the following list:

Dice-coefficient
Levenshtein distance
Jaccard similarity
Cosine similarity
Hamming distance
Damerau-Levenshtein distance

A little extra research on Wikipedia taught me that we are dealing with string metrics here. As for a start, I decided to implement two algorithms that sound simplest to me. Or rather, to save some time, having ChatGPT implement them for me. Here’s my first prompt (I prompted the AI in german, sry):

Ich möchte brauche einen Algorithmus, der mir hilft zwei Wörter zu vergleichen. Zum Beispiel, wie weit unterscheiden sich die beiden Strings “Hallo” und “Holla”. Ein geeignetes Maß dafür ist der Dice-Koeffizient. Bitte schreibe mit eine Python-Funktion, die den Dice-Koeffizienten für zwei strings ausrechnet.

Dice-coefficient

And here is what came back from ChatGPT:

def dice_coefficient(string1, string2):
    # Erstellen von Mengen aus den einzelnen Zeichen der Strings
    set1 = set(string1)
    set2 = set(string2)
    
    # Berechnung des Schnitts der beiden Mengen
    intersection = set1 & set2
    
    # Berechnung der Längen der beiden Mengen
    len1 = len(set1)
    len2 = len(set2)
    
    # Berechnung des Dice-Koeffizienten
    dice_coeff = (2 * len(intersection)) / (len1 + len2)
    
    return dice_coeff

Looks good to me. The number this function returns is a number between 0.0 (no similarity) and 1.0 (perfect match). However, there are two problems with that implementation:

Since the strings are converted to sets at the beginning of the function, we loose the ordering of the characters. This way, “Holla” and “Hallo” are equal, as are both compared to “llaHo”.
The proposed function is case sensitive, which means that “Hallo” not the same as “hallo”.

The second problem can be easily solved by converting the strings to lower cases:

    set1 = set(string1.lower())
    set2 = set(string2.lower())

However, the second problem requires more work. So I go back to ChatGPT, explain the problem and ask for a revised function. ChatGPT sure is not a perfect programmer and it takes four rounds and a small human bugfix, but eventually we have a revised algorithm:

def compareWordForms(self, word_form_1, word_form_2):
        # Berechnung des Schnitts der beiden Mengen unter Berücksichtigung der Reihenfolge
        intersection = [char1 for char1, char2 in zip(word_form_1, word_form_2) if char1 == char2]

        common_chars = len(intersection)

        len1 = len(word_form_1)
        len2 = len(word_form_2)

        # Berechnung des Dice-Koeffizienten
        dice_coeff = (2 * common_chars) / (len1 + len2)

        return dice_coeff

As you see, the intersection variable is filled with all the characters that the two strings have in common. The strings are iterated from beginning to end so that character ordering is of the essence now.

A few testruns give us the following results:

Dice-coefficient ('Hallo', 'Hallo'): 1.0
Dice-coefficient ('Hollo', 'Hallo'): 0.8
Dice-coefficient ('Holla', 'Hallo'): 0.6
Dice-coefficient ('Hallo', 'llHao'): 0.2

Not bad for a first try. Basically, the Dice-coefficient tells us what percentage of characters between two strings are exactly matching. And this can be a usefull measure in practice. Let’s say we have lists from two related languages of which one has undergone a vowel shift from [o] to [a]. So in our collected data we find the words ambon (language 1) and amban (language 2). The Dice-coefficient we get from our algorithm above is 0.8 - quite similar.

Of course, there are many scenarios, in which this implementation of the Dice-coefficient is less usable for our purposes. The most obvious is when a character was deleted or inserted. To continue our example from above: Let’s say a third language has not shifted the vowels but deleted the consonant [b] instead, resulting in the form amon (language 3). Here are the results:

Dice-coefficient ('ambon', 'amon'): 0.444444444
Dice-coefficient ('amban', 'amon'): 0.444444444

These are quite low values if we consider that ambon and amon only differ in one single character. Why is this? Our function correctly compares the first two characters: [a] and [m]. But then it runs into troubles. Since the first word includes [b], it compares [b] and [o] and subsequently, the comparison is shifted by one character ([o] vs. [n] and then [n] vs. nothing). It does not recognize that the final two characters are equal. This is a major disadvantage of the Dice-coeffictient and we feel that it’s time to move on to our next measure. Maybe the Levenshtein distance can save the day.

Levenshtein distance

Just as with the Dice-coefficient, I had no clou about what the Levenshtein distance was but only a little research was enough to show me that this string metric is a huge topic in computer science. This is because it can deal with missing and superflous characters and therefore is a very good metric to iron out typos in a user’s input (e.g. in search engine prompts or in spell checkers).

Unlike the Dice-coefficient, its logic is a little more advanced. It depents on dynamic programming and demands a little more expertise to implement. In short, the Levenshtein distance tells us how many steps of insertion, deletion or alternation of characters are necessary to get from string 1 to string 2. So, for example, to get from ambon to amon we only need 1 step, i.e. deletion of [b]. However, the details of the algorithms are quite advanced since it uses a matrix to calculate results of sub-solutions. Luckily, there are many YouTube videos that explain the logic behind the Levenshtein distance step-by-step, so I leave it to the interested reader to dive deeper into this topic (I recommend you to use pen and paper and fully iterate through the whole algorithm - it can be fun).

For our purposes, we just ask ChatGPT to write a Python function that does the job. Here is our prompt (again in German):

Nun möchte ich für die selbe Aufgabe, d.h. den Vergleich von zwei Wörtern auf ihre Ähnlichkeit, ein anderes Maß verwenden. Und zwar die Levenshtein-Distanz. Schreibe mir bitte eine Python-Funktion, die das übernimmt.

Here is what we get:

def levenshtein_distance(word1, word2):
    # Erstellen einer Matrix zur Speicherung der Teilabstände
    matrix = [[0] * (len(word2) + 1) for _ in range(len(word1) + 1)]
    
    # Initialisierung der ersten Zeile und Spalte der Matrix
    for i in range(len(word1) + 1):
        matrix[i][0] = i
    for j in range(len(word2) + 1):
        matrix[0][j] = j
    
    # Berechnung der Teilabstände
    for i in range(1, len(word1) + 1):
        for j in range(1, len(word2) + 1):
            if word1[i - 1] == word2[j - 1]:
                substitution_cost = 0
            else:
                substitution_cost = 1
            
            matrix[i][j] = min(
                matrix[i - 1][j] + 1,  # Löschen
                matrix[i][j - 1] + 1,  # Einfügen
                matrix[i - 1][j - 1] + substitution_cost  # Ersetzen/Substitution
            )
    
    # Die letzte Zelle der Matrix enthält die Levenshtein-Distanz
    return matrix[-1][-1]

The return value of this function is the number of steps (deletion, insertion or alternation) that are necessary to get from word1 to word2.

[Hier dann noch ein Paar Bsp. ausprobieren (aber dazu nochmal den unnormalisierte Wert nehmen)]

Comparing the measures

The attentive reader will have noticed that the two measures we used above, Dice-coefficient and Levenshtein distance, are quite different. Whereas the first measures the similary of two strings, the second measures their dissimilarity. Furthermore, one is normalised for a value between 0.0 and 1.0, while the other is open from 0 to infinite.

To make these two measures comparable we need to deal with normalisation first. The Levenshtein distance can be normalised by simply dividing the calculated number of steps by the maximum number of steps. The latter is identical with the length of the longer word, since the maximum number of steps would be to simply insert all characters of the longest word.

        # The max distance is simply the length of the longest word
        max_distance = max(len(word_form_1), len(word_form_2))

        # now normalize this lev_distance
        normalized_distance = lev_distance / max_distance

Now we have a value scaled for the interval between 0.0 and 1.0 as well. However, this interval is still in reversed order, since it gives us 0.0 for a perfect match and 1.0 for maximum distance (reversed from the Dice-coefficient). We deal with this by simply subtracting the resulting value from 1. Since the result is no longer a measure of distance, we call it Levenshtein similarity:

        # now transform this distance into a measure of similarity
        normalized_similarity = 1 - normalized_distance

Now we are ready for some trial runs:

Similarity for words 'ambon' and 'amon'.
Dice score = 0.4444444444444444
Lev score = 0.8

Similarity for words 'ponomomba' and 'prinomomba'.
Dice score = 0.10526315789473684
Lev score = 0.8

Similarity for words 'amicus' and 'ami'.
Dice score = 0.6666666666666666
Lev score = 0.5

As we expected, the character deletion in the middle of ambon is better captured by Levenshtein similary. The same is true for some mutations at the beginning of the word, see ponomomba vs. prinomomba. However, deletions at the end of the word are better captured by the Dice-coefficient. As we see in Latin amicus vs. French ami, the Dice-coefficient is higher than the Levenshtein similarity.

This later fact should be of interest for all historical linguists. Language change often leads to elisions at the end of the word, as e.g. dropping the latin suffix -us. Therefore, the Dice-coefficient is quite usefull in such cases.

Future outlook

This short test showed me that the measure one takes should depend on what one tries to find out. Indeed, Levenshtein distance is the better measure if words differ in such a way that some parts were elided or inserted. However, it also depends on where this happened. Ellisions at the end of the word (while the rest of the word stays the same) are better captured with the Dice-coefficient. One can easily think of another scenario: the case when initial characters were lost, e.g. if I have recorded a word form with a prefix in one list and the same word without the prefix in the other list. Let’s say two verbs: abkanggi and kanggi. Here are the results from our algorithms:

Similarity for words 'abkanggi' and 'kanggi'.
Dice score = 0.0
Lev score = 0.75

Since the words differ right from the start, the Dice-coefficent is pretty much useless here. But what if we alter it a little? Instead of comparing from left to right, we could just reverse the strings and compare the words backward. This way we get the following:

Similarity for words 'abkanggi' and 'kanggi'.
Dice score = 0.0
Lev score = 0.75
Reversed Dice score = 0.8571428571428571

Now the reversed Dice-coefficient clearly outranks the Levensthein similarity and correctly measures the high similarity between abkanggi and_kanggi_.

This is just another realistic case in which the Dice-coefficent becomes quite usefull. So I conclude that we need different measures for different scenarios. Instead of running through all of our data with one and the same algorithm, we can try various measures and compare their benefits and shortcomings.

In part 2, I will look out for other measures of lexical similarity. Finally, I will integrate all findings and apply them to my collected wordlists in part 3.