All posts by m1chal1s

console

A fast implementation of the Levenshtein algorithm in C#

Given two strings, Levenshtein gives us the distance between these strings expressed as the minimum count of character operations (deletions, insertions, or substitutions) that are necessary to convert one string to the other.

To give a few examples, here are all the Levenshtein distances of the strings “Jack”, “Jake”, “Moe”, “Jo”, “Joe”, “Stew”, and “Stewart”:

First string Second string Levenshtein
Jack Jake 2
Jack Moe 4
Jack Jo 3
Jack Joe 3
Jack Stew 4
Jack Stewart 6
Jake Moe 3
Jake Jo 3
Jake Joe 2
Jake Stew 4
Jake Stewart 6
Moe Jo 2
Moe Joe 1
Moe Stew 3
Moe Stewart 6
Jo Joe 1
Jo Stew 4
Jo Stewart 7
Joe Stew 3
Joe Stewart 6
Stew Stewart 3

The Levenshtein value of two identical strings is always 0. That is why I have omitted all the rows where the first string is the same as the second string. Also, the Levenshtein function is commutative. That is, Levenshtein(s, t) equals Levenshtein(t, s).

Levenshtein is a rather simple algorithm – but it’s a useful one. It allows us to easily compare anything that can be expressed as a string.

There’s a rather good implementation of the Levenshtein algorithm on the Wikipedia page, but I’ve made a few changes of my own. Why? Well, even though the changes I’ve made are minor, my implementation is faster. In fact, depending on the hardware, my implementation is 5% to 20% faster.

To give an example, I’ve run both implementations (the one on Wikipedia, and mine) on a modern laptop with a 4th generation i7 processor 1073741824 (2^30) times – that is a bit more than a billion calls. This many calls take 00:09:10.1372578 (9 minutes and ≈10 seconds) on the Wikipedia algorithm, and 00:07:32.4470743 (7 minutes and ≈32 seconds) on mine. That’s 17,76% faster – an improvement of about 424 extra calls per millisecond.

Calls Original (Wikipedia) Improved %
67108864 00:00:33.8682674 00:00:27.3321221 19,30%
134217728 00:01:07.7492047 00:00:55.2858276 18,40%
268435456 00:02:17.7560969 00:01:52.9447724 18,01%
536870912 00:04:35.9671041 00:03:46.1908114 18,04%
1073741824 00:09:10.1372578 00:07:32.4470743 17,76%

So, without further ado, here’s the Levenshtein code: