Algorithms for the approximate string matching

This topic was presented at Search Engines Amsterdam Meetup (28 October 2016, Academisch-cultureel centrum SPUI25).

Summary: The approximate string matching or the fuzzy search is a technique of finding strings in the text or in the dictionary that match a pattern approximately. The most common applications of a fuzzy search are the spell checking, the syntax error correction, the spam filtering and the correction of OCR errors.
At ZyLAB we are developing the enterprise level Full-Text Search Engine that supports a boolean and proximity operators, wildcards, regular expressions and a fuzzy search. Over the years the company has gained a lot of experience in this area. ZyLAB Search Engine is built upon state of the art algorithms and data structures.
In this talk we are going to present an overview of the approximate string matching algorithms that use Levenshtein metric. A practical application of these algorithms in ZyLAB software will be discussed.

About presenter Petro Protsyk obtained Master's Degree in Computer Science from Taras Shevchenko National University of Kyiv. Since 2006 Petro works as a professional software developer. He joined ZyLAB in 2010 to work on the company's proprietary Search Engine. During the last 6 years Petro focused on building distributed solutions for processing large volumes of data and implementing a new full-text and audio search engines for ZyLAB.

For more information about ZyLAB, see: http://www.zylab.com

Links

  • Slides of presentation (pptx, pdf).

Implementation of the presented approximate string matching algorithms in C#

Levenshtein distance between two strings using brute-force recursive algorithm (source).

Levenshtein distance between two strings using dynamic programming algorithm (source).

Check if two strings are within specified Levenshtein distance from each other using shift-and algorithm also known as Bitap or Baeza-Yates-Gonnet algorithm (source).

Check if two strings are within specified Levenshtein distance from each other using Levenshtein automaton (source).

Last updated: 15.07.2017,
created: 15.11.2016.