Algorithms for approximate string matching
 

This topic was presented at Search Engines Amsterdam Meetup (28 October 2016, Academisch-cultureel centrum SPUI25).

Summary: Approximate string matching or fuzzy search is the technique of finding strings in text or dictionary that match pattern approximately. The most common applications of fuzzy search are spell checking, syntax error correction, spam filtering, correction of OCR errors. At ZyLAB we are developing an enterprise level full-text search engine that supports boolean and proximity operators, fast wildcard, regular expression and fuzzy search. Over the years we have gained a lot of experience in this area and implemented various algorithms and data structures in ZyLAB Search Engine. In this talk we are going to present an overview of approximate string matching algorithms that use Levenshtein metric and their practical application in our software.

About presenter Petro Protsyk has a MSc degree in Computer Science at Taras Shevchenko National University of Kyiv, Cybernetics department in 2006. Since then Petro works as a professional software developer. He joined ZyLAB in 2010 to work on ZyLAB's proprietary Search Engine and eDiscovery solution. During the last 6 years Petro focused on building distributed solutions for processing large volumes of data and implementing a new version of the full-text search engine for ZyLAB. For more information, see https://www.zylab.com

Links

  • Slides of presentation (pptx, pdf).

Implementation of presented approximate string matching algorithms in C#

Levenshtein distance between two strings using brute-force recursive algorithm (source).

Levenshtein distance between two strings using dynamic programming algorithm (source).

Check if two strings are within specified Levenshtein distance from each other using shift-and algorithm also known as Bitap or Baeza-Yates–Gonnet algorithm (source).

Check if two strings are within specified Levenshtein distance from each other using Levenshtein automaton (source).