PSC.Search: a new implementation of Levenshtein distance algorithm

This article describes a simple implementation of the string search. It can be used for approximate string matching (for more information, see http://en.wikipedia.org/wiki/Fuzzy_string_searching).

Other algorithms for approximate string searching exist (e.g., Soundex), but those aren't as easy to implement. The algorithm in this article is easy to implement, and can be used for tasks where approximate string searching is used in an easy way.

The algorithm used the Levenshtein-distance for determining how exact a string from a word list matches the word to be found. Information about the Levenshtein-distance can be found at http://en.wikipedia.org/wiki/Levenshtein_distance.

C# implementation

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace PSC.Search.Demo
{
    class Program
    {
        static void Main(string[] args)
        {
            string word = "Pure Source Code";
            List<string> wordList = new List<string>
            {
                "Code Project",
                "Pure SourceCOde",
                "sourcecode",
                "puresourcecode",
                "Source Kode",
                "Kode Project",
                "Other Source"
            };

            List<wordrank> foundWords = 
                           FuzzySearch.Search(word, wordList, 0.30);

            foundWords.ForEach(i => Console.WriteLine(i.Word + 
                                          " (" + i.Rank.ToString() + 
                                          ")"));
            Console.ReadKey();
        }
    }
}

Output:

FuzzySearch

A basic approach is shown. Instead of the Levenshtein-distance, a more optimized algorithm could be used - but here, a quite simple implementation is given for clarity reasons.

public static int LevenshteinDistance(string src, string dest)
{
    int[,] d = new int[src.Length + 1, dest.Length + 1];
    int i, j, cost;
    char[] str1 = src.ToCharArray();
    char[] str2 = dest.ToCharArray();

    for (i = 0; i <= str1.Length; i++)
    {
        d[i, 0] = i;
    }

    for (j = 0; j <= str2.Length; j++)
    {
        d[0, j] = j;
    }

    for (i = 1; i <= str1.Length; i++)
    {
        for (j = 1; j <= str2.Length; j++)
        {
            if (str1[i - 1] == str2[j - 1])
                cost = 0;
            else
                cost = 1;

            d[i, j] =
                Math.Min(
                    d[i - 1, j] + 1,                    // Deletion
                    Math.Min(
                        d[i, j - 1] + 1,                // Insertion
                        d[i - 1, j - 1] + cost));       // Substitution

            if ((i > 1) && (j > 1) && 
                (str1[i - 1] == str2[j - 2]) && 
                (str1[i - 2] == str2[j - 1]))
            {
                d[i, j] = Math.Min(d[i, j], d[i - 2, j - 2] + cost);
            }
        }
    }

    return d[str1.Length, str2.Length];
}

In the search process, for each word in the wordlist, the Levenshtein-distance is computed, and with this distance, a score. This score represents how good the strings match. The input argument fuzzyness determines how much the strings can differ.

public static List<wordrank> Search(string word, List wordList, 
                                    double fuzzyness)
{
    List<wordrank> foundWords = new List<wordrank>();

    foreach (string s in wordList)
    {
        // Calculate the Levenshtein-distance:
        int levenshteinDistance = LevenshteinDistance(word, s);

        // Length of the longer string:
        int length = Math.Max(word.Length, s.Length);

        // Calculate the score:
        double score = 1.0 - (double)levenshteinDistance / length;

        // Match?
        if (score > fuzzyness)
            foundWords.Add(new WordRank() { Rank = score, Word = s });
    }

    return foundWords;
}

SQL implementation

Also I've implemented a version for SQL.

CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
 DECLARE @s1_len int, @s2_len int
 DECLARE @i int, @j int, @s1_char nchar, @c int, @c_temp int
 DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000)

 SELECT
  @s1_len = LEN(@s1),
  @s2_len = LEN(@s2),
  @cv1 = 0x0000,
  @j = 1, @i = 1, @c = 0

 WHILE @j <= @s2_len
  SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1

 WHILE @i <= @s1_len
 BEGIN
  SELECT
   @s1_char = SUBSTRING(@s1, @i, 1),
   @c = @i,
   @cv0 = CAST(@i AS binary(2)),
   @j = 1

  WHILE @j <= @s2_len
  BEGIN
   SET @c = @c + 1
   SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
    CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
   IF @c > @c_temp SET @c = @c_temp
   SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
   IF @c > @c_temp SET @c = @c_temp
   SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
 END

 SELECT @cv1 = @cv0, @i = @i + 1
 END

 RETURN @c
END

Usage:

select
 dbo.edit_distance('Fuzzy String Match','fuzzy string match'),
 dbo.edit_distance('fuzzy','fuzy'),
 dbo.edit_distance('Fuzzy String Match','fuzy string match'),
 dbo.edit_distance('levenshtein distance sql','levenshtein sql server'),
 dbo.edit_distance('distance','server')



Happy coding!

Advertsing

125X125_06

TagCloud

MonthList

CommentList