This article describes a simple implementation of the string search. It can be used for approximate string matching (for more information, see https://en.wikipedia.org/wiki/Fuzzy_string_searching).
Other algorithms for approximate string searching exist (e.g., Soundex), but those aren’t as easy to implement. The algorithm in this article is easy to implement, and can be used for tasks where approximate string searching is used in an easy way.
The algorithm used the Levenshtein-distance for determining how exact a string from a word list matches the word to be found. Information about the Levenshtein-distance can be found at https://en.wikipedia.org/wiki/Levenshtein_distance.
C# implementation
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; namespace PSC.Search.Demo { class Program { static void Main(string[] args) { string word = "Pure Source Code"; List<string> wordList = new List<string> { "Code Project", "Pure SourceCOde", "sourcecode", "puresourcecode", "Source Kode", "Kode Project", "Other Source" }; List<wordrank> foundWords = FuzzySearch.Search(word, wordList, 0.30); foundWords.ForEach(i => Console.WriteLine(i.Word + " (" + i.Rank.ToString() + ")")); Console.ReadKey(); } } }
Output:
A basic approach is shown. Instead of the Levenshtein-distance, a more optimized algorithm could be used – but here, a quite simple implementation is given for clarity reasons.
public static int LevenshteinDistance(string src, string dest) { int[,] d = new int[src.Length + 1, dest.Length + 1]; int i, j, cost; char[] str1 = src.ToCharArray(); char[] str2 = dest.ToCharArray(); for (i = 0; i <= str1.Length; i++) { d[i, 0] = i; } for (j = 0; j <= str2.Length; j++) { d[0, j] = j; } for (i = 1; i <= str1.Length; i++) { for (j = 1; j <= str2.Length; j++) { if (str1[i - 1] == str2[j - 1]) cost = 0; else cost = 1; d[i, j] = Math.Min( d[i - 1, j] + 1, // Deletion Math.Min( d[i, j - 1] + 1, // Insertion d[i - 1, j - 1] + cost)); // Substitution if ((i > 1) && (j > 1) && (str1[i - 1] == str2[j - 2]) && (str1[i - 2] == str2[j - 1])) { d[i, j] = Math.Min(d[i, j], d[i - 2, j - 2] + cost); } } } return d[str1.Length, str2.Length]; }
In the search process, for each word in the wordlist, the Levenshtein-distance is computed, and with this distance, a score. This score represents how good the strings match. The input argument fuzzyness determines how much the strings can differ.
public static List<wordrank> Search(string word, ListwordList, double fuzzyness) { List<wordrank> foundWords = new List<wordrank>(); foreach (string s in wordList) { // Calculate the Levenshtein-distance: int levenshteinDistance = LevenshteinDistance(word, s); // Length of the longer string: int length = Math.Max(word.Length, s.Length); // Calculate the score: double score = 1.0 - (double)levenshteinDistance / length; // Match? if (score > fuzzyness) foundWords.Add(new WordRank() { Rank = score, Word = s }); } return foundWords; }
SQL implementation
Also I’ve implemented a version for SQL.
CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999)) RETURNS int AS BEGIN DECLARE @s1_len int, @s2_len int DECLARE @i int, @j int, @s1_char nchar, @c int, @c_temp int DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000) SELECT @s1_len = LEN(@s1), @s2_len = LEN(@s2), @cv1 = 0x0000, @j = 1, @i = 1, @c = 0 WHILE @j <= @s2_len SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1 WHILE @i <= @s1_len BEGIN SELECT @s1_char = SUBSTRING(@s1, @i, 1), @c = @i, @cv0 = CAST(@i AS binary(2)), @j = 1 WHILE @j <= @s2_len BEGIN SET @c = @c + 1 SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) + CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END IF @c > @c_temp SET @c = @c_temp SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1 IF @c > @c_temp SET @c = @c_temp SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1 END SELECT @cv1 = @cv0, @i = @i + 1 END RETURN @c END
Usage:
select dbo.edit_distance('Fuzzy String Match','fuzzy string match'), dbo.edit_distance('fuzzy','fuzy'), dbo.edit_distance('Fuzzy String Match','fuzy string match'), dbo.edit_distance('levenshtein distance sql','levenshtein sql server'), dbo.edit_distance('distance','server')