Algorithms for the approximate string matching

This topic was presented at Search Engines Amsterdam Meetup (28 October 2016, Academisch-cultureel centrum SPUI25).

Summary: The approximate string matching or the fuzzy search is a technique of finding strings in the text or in the dictionary that match a pattern approximately. The most common applications of a fuzzy search are the spell checking, the syntax error correction, the spam filtering and the correction of OCR errors.
In this talk we present an overview of the approximate string matching algorithms that use Levenshtein metric.
At ZyLAB we are developing an enterprise level Full-Text Search Engine that supports a boolean and proximity operators, wildcards, regular expressions and a fuzzy search. Over the years the company has gained a lot of experience in this area.

About presenter Petro Protsyk obtained Master's Degree in Computer Science from Taras Shevchenko National University of Kyiv. Since 2006 Petro works as a professional software developer. He joined ZyLAB in 2010 to work on the company's proprietary Search Engine.

For more information about ZyLAB, see: http://www.zylab.com

Links

  • Slides of presentation (pptx, pdf).

Implementation of the presented approximate string matching algorithms in C#

Levenshtein distance between two strings using brute-force recursive algorithm (source).

Levenshtein distance between two strings using dynamic programming algorithm (source).

Check if two strings are within specified Levenshtein distance from each other using shift-and algorithm also known as Bitap or Baeza-Yates-Gonnet algorithm (source).

Check if two strings are within specified Levenshtein distance from each other using Levenshtein automaton (source).

Last updated: 18.12.2020,
created: 15.11.2016.