Monday, May 19, 2014

Regular Expressions provide a compact and expressive notation for describing patterns of text. Their algorithms are interesting and are helpful in a variety of applications. They are accredited with making scripting an alternative to programming. They come in several flavors but almost all use wild cards which are special notations.
Some common wildcards include the following
* to denote everything
. to denote a single character
$ for the end of string
^ for the beginning of a string
^$ matches an empty string
[] to include specific sets of characters
+ for one or more occurrences
? for zero or more occurrences
\ as a prefix to quote a meta character
These building blocks are combined with parenthesis for grouping.

Even if we take the four elementary regular expression wild cards ^, $, . and * we can use those for a a modest regular expression search function. Kerninghan and Pike showed that with functions for match and matchhere this can be simple and easy. match determines whether a string matches a regular expression. matchhere checks if the text matches at each position in turn. As soon as we find a match, the search is over. If the expression begins with a ^, the text must begin with a match of the remainder of the pattern. Otherwise the characters are traversed one by one invoking match here at each position. Even empty text would need to be iterated because * can match empty strings.
Example 1:
Regular Expressions

by Brian W. Kernighan and Rob Pike
/* match: search for re anywhere in text */
int match(char *re, char *text)
{
   if (re[0] == '^')
      return matchhere(re+1, text);
   do {  /* must look at empty string */
      if (matchhere(re, text))
         return 1;
   } while (*text++ != '\0');
   return 0;
}

The matchhere  method searches for regular expression at the beginning of the text.  We first check for a null or empty string, then we check for the asterisk to see if all the text can be matched.  When checking for all the text the delimiter is either a null terminator or a character that doesn't match or the expression doesn't have a '.'. We call this the match star method. Next we check for the $ or the null terminator in the text. Lastly we check again for more text and call matchhere.

The matchstar method matches zero or  more instances by calling matchhere for the text one character at a time until the prefix doesn't match or the the prefix is a dot which means any character or until the end of the text is reached.

The matchhere method calls the match star method in one of the four cases that it checks. For all the other cases, it either returns a boolean corresponding to end of pattern reached or end of text reached or it calls itself recursively if the previous character in the regular expression is a period or a literal match.

We will look at a few more examples of this regular expression search.

No comments:

Post a Comment