Friday, February 7, 2014

At the heart of every text searching, there is a pattern match that's expected. The regex operator is very widely used and it is equally important as with any software for searching and particularly in Splunk.
Let us look at this operator more closely and see if we can find an implementation that works well. There is a need to optimize some codepaths for fast pattern matching especially for simple patterns. However, here we want to focus on the semantics, organization and the implementation.
Patterns are best described by Group and Captures.
A Group can be a literal or a pattern. Groups can be nested and indicate one or more occurrences of their elements.
A Capture is a match between a group and the text.
A Capture has such things as index and length of the match within the original string.
 A group can have many captures often referred to as CaptureCollection.
A match may have many groups each identified by a group number for that match
Matches can follow one after the other in a string. It's necessary to find all. The caller can call Match.NextMatch() to iterate over them.
The results of the output should look something like this:
Original text
Match found :
    Group 1=
                Capture 0 =   value      Index=      Length=
                Capture 1 =   value      Index=      Length=
    Group 2=
                Capture 0 =   value      Index=      Length=
                :
and so on.
Since wild cards and other meta characters are supported, it is important to match the group for each possible candidate capture.
All captures are unique in the sense that they have a distinct index and length pair. Indexes and Length won't be sequential but the larger captures precede the smaller captures because the smaller are typically the subset of the bigger.
           


No comments:

Post a Comment