The [http://en.wikipedia.org/wiki/Levenshtein_distance|Levenshtein edit distance] calculation is useful for comparing text strings for similarity, such as would be done with a spell checker. [[hi_saito]] has written what looks like a [http://d.hatena.ne.jp/Rocco/20071118/p2|straightforward implementation] of the reference algorithm described in the above-linked Wikipedia article. hi_saito's code is linked to rather than included outright because no licensing terms appear on the page. [[gnomon]] is planning to write a more compact (and hopefully speedier) implementation that will appear here soon. The plan is to compute and retain only those values that are necessary to calculate the edit distance, rather than calculating the entire NxM matrix. The [http://www.csse.monash.edu.au/~lloyd/tildeStrings/Alignment/92.IPL.html|lazy-evaluation method], which can post substantial speed improvements, probably requires more effort and code complexity than the performance gains would be worth; still, for short strings, the lazy code could perhaps be modeled via recursion by executing from the end of the string rather than the beginning. If experiments are run, the results will also appear here. Here is the abovementioned streamlined implementation. There were eleven previous versions, all of which were benchmarked across gawk, mawk and busybox awk. The approaches started with a naive implementation and explored table-based, recursive (with no, single and shared memoization) and lazy models. As expected, the lazy version was incredibly fiddly and not pleasant to read or pursue. Findings will appear here later, but for now, here's the code. ---- === The Algorithm {{{ awk # commentary elided for now pending feedback function levdist(str1, str2, l1, l2, tog, arr, i, j, a, b, c) { if (str1 == str2) { return 0 } else if (str1 == "" || str2 == "") { return length(str1 str2) } else if (substr(str1, 1, 1) == substr(str2, 1, 1)) { a = 2 while (substr(str1, a, 1) == substr(str2, a, 1)) a++ return levdist(substr(str1, a), substr(str2, a)) } else if (substr(str1, l1=length(str1), 1) == substr(str2, l2=length(str2), 1)) { b = 1 while (substr(str1, l1-b, 1) == substr(str2, l2-b, 1)) b++ return levdist(substr(str1, 1, l1-b), substr(str2, 1, l2-b)) } for (i = 0; i <= l2; i++) arr[0, i] = i for (i = 1; i <= l1; i++) { arr[tog = ! tog, 0] = i for (j = 1; j <= l2; j++) { a = arr[! tog, j ] + 1 b = arr[ tog, j-1] + 1 c = arr[! tog, j-1] + (substr(str1, i, 1) != substr(str2, j, 1)) arr[tog, j] = (((a<=b)&&(a<=c)) ? a : ((b<=a)&&(b<=c)) ? b : c) } } return arr[tog, j-1] } }}} ---- === Ancillary Code ==== Generating all pairs of words from a list, and running an edit distance between them {{{ awk BEGIN {OFS = "\t"} {words[NR] = $0} END { max = 0 for (i = 2; i in words; i++) { for (j = i + 1; j in words; j++) { #new = levdist(words[i], words[j], memo) # for levdist-04 and -10, recursive with memoization new = levdist(words[i], words[j]) print words[i], words[j], new if (new > max) { max = new bestpair = (words[i] " - " words[j] ": " new) } } } print bestpair } }}} ==== Unit tests {{{ awk #!/usr/bin/awk -f function testlevdist(str1, str2, correctval, testval) { testval = levdist(str1, str2) if (testval == correctval) { printf "%s:\tCorrect distance between '%s' and '%s'\n", testval, str1, str2 return 1 } else { print "MISMATCH on words '%s' and '%s' (wanted %s, got %s)\n", str1, str2, correctval, testval return 0 } } BEGIN { testlevdist("kitten", "sitting", 3) testlevdist("Saturday", "Sunday", 3) # grabbed from wikipedia: http://en.wikipedia.org/wiki/Levenshtein_distance testlevdist("acc", "ac", 1) # example given by hi_saito in #awk on Freenode on 2008-06-21 at 0800h-5Z testlevdist("foo", "four", 2) testlevdist("foo", "foo", 0) testlevdist("cow", "cat", 2) testlevdist("cat", "moocow", 5) testlevdist("cat", "cowmoo", 5) testlevdist("sebastian", "sebastien", 1) testlevdist("more", "cowbell", 5) # grabbed from the unit tests for the Perl module 'Text-Levenshtein' # http://search.cpan.org/dist/Text-Levenshtein/ testlevdist("freshpack", "freshpak", 1) testlevdist("freshpak", "freshpack", 1) # grabbed from Slava Pestov # http://factor-language.blogspot.com/2006/06/levenshtein-edit-distance-algorithm-in.html } }}} ---- === Further Reading * [http://www.csse.monash.edu.au/~lloyd/tildeStrings/Alignment/92.IPL.html|Lazy Dynamic Programming Can Be Eager] (Lloyd Allison, 1992) - how to express the Levenshtein computation as a lazily-evaluated data structure that only evaluates what is strictly necessary to find the edit distance. The runtime of levdist(m, n) is claimed to be O(n * (1 + levdist(m, n))) and the programming language used is Lazy ML. I'm not sure if I like this approach: yes, there are potentially dramatic performance gains to be made if the strings are very large and the edit distances are relatively small (say, for example, in the case of comparing two large DNA sequences - in fact, perhaps that case ought to be added to the unit tests), but if the algorithm is primarily going to be used for spell-checking single words, the complexity cost could arguably be too high.
Summary:
This change is a minor edit.
Username: