C++ 11 regex based tokenizer

Tokenization is a common task in many computer programs. Tokenization is a process of splitting text into tokens. For example, text: “My birthday is 20/01/1984″. Might be split into the following tokens: “My” “birthday” “is” “20/01/1984″.

The tokenization approach that I am going to use here will be based on regular expressions from C++ Standard Library. Tokenization will be performed in two phases. During the first phase std::regex_token_iterator will be used to split input text into tokens using regular expression that will match whitespaces. The real power of regular expressions comes at the second stage. During this stage a tokenizer will match each token against a sequence of regular expressions to recognize dates, numbers, e-mail addresses or urls. This list of token types might be easily extended by adding new regular expressions. Even though this approach is simple and powerful it is only practical for parsing small texts.

Matching text against regular expressions might be expensive, especially at the second stage. On my Dell Studio 1555 laptop (Intel Core 2 Duo P8700 2.53GHz with 4GB of RAM and x64 Windows 7) the speed of tokenization is about 400-800 kilobytes per second.

The C++ code that implements presented idea is available here: Regex Tokenizer.

The attached code uses some regular expressions from the following website: regexlib.com.

A good introduction to TR1 C++ regular expressions might be found here and in this video.