Sunday, February 9, 2014

In regular expression search, matches can be partial or full. In order that we process all the groups in a full match, we iterate multiple times.
In either case, there are a few approaches to pattern matching.
The first is backtracking
The second is finite automata
The third approach is RE2.
Often production implementations mix and match more than one of the approaches above.
We discuss these briefly here.

Kerninghan and Pike mention a very fast and simple backtracking approach. Matches are determined one position at a time. Positions need not be contiguous. If the match occurs, the remainder of the pattern is checked against the rest of the string. A recursive function helps here and the recursion can be nested upto the length of the pattern. This pattern is maintained based on the literals and metacharacters provided. The empty string can match with an aseterisk.  A separate recursive method may exist for the asterisk wild card character since it is matched in one of three different ways, depending on the parameters to the function.

RE2 approach is used by PCRE, perl and python. Unlike the backtracking approach above that can take an exponential order of time even though such an approach can support a variety of features, RE2 is efficient in that it uses automata theory and supports leftmost-first match. This approach has the following steps:
Step 1: Parse (Walking a Regexp)
Step 2: Simplify
Step 3: Compilstea.de
Step 4: Matchns in

The above steps are explained as below:

The parser could be simple but there's a variety of regular expressions that can be specified. Some parsers maintain an explicit stack instead of using recursion. Some parsers allow a single linear scan by keeping reverse polish notation. Others use trees or explicit stack. Its possible to parse a regular  expression as a decision tree on the specified input text to see if any of the substrings at a position match the pattern. I take that back, we could do with a graph of predefined operations instead. These operations are concatenation, repetition and alternation.
The next step is to simplify.The purpose of simplifying is to reduce the memory footprint.
The result of the next step i.e. compilation is an instruction graph that makes it easy for pattern matching. This way parser doesn't need to be invoked again and again.
Lastly, the step for matching can be both partial or full.

No comments:

Post a Comment