**amazingly useful**concepts that I use on a daily basis, so I thought I'd try to set down everything you need to know about these subjects in one page (or thereabouts).

**Regular Expressions for Searching**

The most common use of regular expressions is for searching through text files. For example, if I wanted to search my source code for all class definitions that include the word 'Collection' in their class names, I would search on the following regular expression: '

*class .*Collection*'. This is read as "the word

*class*, followed by a space, followed by zero or more of (

***) any character (

*.*), followed by the word

*Collection*". That is, most characters are interpreted literally in a regular expression, but some are treated specially (e.g., '

*.*' and '

***').

Here's the typical core grammar of regular expressions (I use

*a, b, c*to denote non-special characters and

*x, y, z*to denote arbitrary regular expressions):

*a*-- the regular expression matching the (non-special) character 'a';*xy*-- the regular expression matching*x*followed by*y*;*(x)*-- the regular expression matching*x*(it is sometimes useful to treat a complex regular expression as a single item);*x|y*-- the regular expression matching*x*or*y*;*x**-- the regular expression matching zero or more instances of*x;*- nil -- the regular expression matching the empty string (which in practice is just written as the empty string, rather than 'nil').

*(aa|bb)*c*' matches any sequence of

*aa*or

*bb*pairs, followed by a

*c*. That is, the

*language*recognised by '

*(aa|bb)*c*' consists of the strings

*{c, aac, bbc, aaaac, aabbc, bbaac, bbbbc, ...}*.

In practice, the language of regular expressions is extended with plenty of syntactic sugar. Here are some standard extensions:

*x+*-- the regular expression matching one or more instances of*x*(i.e.,*x+ = xx**);*x?*-- the regular expression matching zero or one instance of*x*(i.e.,*x? = (x|*nil*);**[abc]*-- the regular expression matching any of the characters*a, b,*or*c*;*[a-f]*-- the regualar expression matching any of the characters in the range*a..f*;*[^abc]*-- the regular expression matching any character that is*not a, b,*or*c*;*[^a-f]*-- the regular expression matching any character that is*not*in the range*a..f*;*^*-- the regular expression matching the start of a line of text;*\$*-- the regular expression matching the end of a line of text;*\a*-- the regular expression matching the escape sequence*a*(e.g.,*\**matches a literal *, \t matches a tab character, \( matches an open parenthesis, etc.)

Learning these is easy and

*really, really useful!*

**State Machines**

Imagine a machine with a set of lights. Each light is called a

*state*and the machine is called a*state machine*. When the machine is started, it always starts with a particular light on (all other lights are off); this is the*start state*.The machine takes as input a sequence of

*symbols*of some type (e.g., characters, GUI events, messages, whatever you like.). As each symbol is read by the machine, which light is shining changes depending on which light was shining just before the new symbol was read*and*on what particular symbol was just read in. For example, if the machine is in state 3, say (i.e., light number three is on) and the next symbol read in is a 'q', then light 3 might turn off and light 7 might come one (i.e., reading a 'q' when in state 3 causes the machine to change to state 7). The machine has a*transition rule*like this for every possible input for every possible state.Each of the lights on the machine is either red or green. The green lights denote

*accepting states*and the red lights denote*non-accepting states*. If the input sequence finishes with the machine showing a green light, then the input sequence was*accepted*by the state machine. Otherwise the input sequence was not accepted.The

For example, here's a state machine to match our earlier pattern

*language*accepted by the state machine is the set of input sequences that leave it in an accepting state. In fact, every*regular expression*has a corresponding state machine and*vice versa*-- they are two ways of describing the same thing!For example, here's a state machine to match our earlier pattern

*(aa|bb)*c*:State | Input | Next state |
---|---|---|

s1 [start] | a | s2 |

b | s3 | |

c | s4 | |

(...) | s5 | |

s2 | a | s1 |

(...) | s5 | |

s3 | b | s1 |

(...) | s5 | |

s4 [accepting] | (...) | s5 |

s5 [error] | (...) | s5 |

Observe the inclusion of an

[Aside: the fancy name for what I've just described is a

*error state*(*s5*) which, like the Hotel California, you can never leave. This state just ensures that the state machine never enters an accepting state once the input has differed from the pattern we are trying to match.[Aside: the fancy name for what I've just described is a

*deterministic finite automaton*or*DFA*.]**Deterministic and Non-deterministic State Machines**

The state machine described above is

*deterministic*, meaning it is only ever in one state and each state transition is defined entirely by the current state and the next symbol to be read. This is a very useful notion for designing programs that have to process sequences.One can also imagine a

*nondeterministic*state machine. Such a machine would works like this: at any point, any collection of lights can be on (i.e., the machine*could be in any one*of these states given the input so far). The transition rules are relaxed so that each state can have multiple possible successors for each input, plus have the option of having successor states that can be reached*without consuming any input at all*. For example, the transition rules for state 3 might say "*move to state 7 after reading a 'q'**". So, for our non-deterministic state machine, if light 3 is one of the lights that is currently on, then light 9 must also be on (because it is a "no input" successor to 3); also, if a 'q' is read next, then both lights 7 and 8 will come on.***or**move to state 8 after reading a 'q'**or**move to state 9 without reading anythingWhy is this idea useful? Well, here's the magic:

*you can always turn a non-deterministic state machine into a deterministic state machine*(how this magic happens is a little beyond the scope of this article, but the starting point is to imagine every possible subset of lights on the*non-deterministic*state machine as corresponding to a single state on the equivalent*deterministic*state machine).It's easy to see how to construct a deterministic state machine that handles simple sequences of symbols or repetition of simple sequences, the tricky bit is handling

*alternation*(i.e., regular expressions of the form*x|y*). The easiest way of constructing a state machine to match*x|y*is to construct a*non-deterministic*state machine from the two state machines for recognising*x*and*y*(hopefully you can imagine how this might be done without too much difficulty),*then*we can turn that into a*deterministic*state machine.In practice, textual regular expressions are typically just compiled into representations of their

*non-deterministic*state machines and the matching algorithm simulates all possible state transitions at each point. This is actually quite effective, since the cost of searching a string of*m*characters with an*n*state machine is just*O(mn)*(and typically much less).

**What Can't Regular Expressions Match?**

Roughly speaking, regular expressions can't match any pattern that requires counting to arbitrarily large numbers. For example, you can't write a regular expression that checks whether the parentheses ( ) in a program are properly balanced, because a program can contain any number of parentheses.

[What follows is for extra credit.]

There is something called the

*pumping lemma*, which says that for any state machine with*n*states that accepts some input of length*n*or more symbols, then that input can be written*somehow*as*xyz*(where*y*is a non-empty sequence and the length of*xy*is at most*n*) and the machine must also accept all patterns of the form*xy*z*(i.e., the machine had to go through at least one loop while recognising the*y*part in*xyz*).How does this relate to counting patterns? Well, let's say we

*did*claim to have a state machine with 4 states that matched*all and only*programs with properly balanced parentheses. Now consider the balanced input sequence '((((a))))', which our state machine duly accepts. The pumping lemma tells us that this must be decomposable as some*xyz*where*y*is non-empty and the length of*xy*is at most 4. Well, we only have four possibilities for*x*and*y*here -- it doesn't matter which one we end up with. Let's say the answer happens to be*x = '*(((' and*y = '*(' -- leaving*z = 'a*))))'*.*The pumping lemma says that our machine must accept all strings of the form*xy*z*, which clearly includes just*xz*, namely '(((a))))', which is clearly not balanced! Therefore our claim must be wrong: our 7 state machine*does not*match all and only well balanced programs. Since there is nothing in our argument that really depends on the 4 states starting point, the same thing must be true for*any number*of states that a parenthesis-matching state machine might have. Therefore,*there is no regular expression (or state machine) that matches all and only well balanced programs*. If you don't believe me, I encourage you to have a go at designing such a state machine!I find all this stuff really rather beautiful.

**What About Counting Support in Modern Regular Expressions?**

Modern textual regular expressions usually include some support for counting, such as

*x{40,70}*meaning "

*between 40 and 70 instances of the pattern x*". This functionality comes at a cost: since the size of the equivalent ordinary regular expressions for these "counting" patterns quickly grows out of hand, they are implemented using backtracking pattern matchers. These are exponentially slower than non-deterministic state machines and it is easy to construct pathological search patterns that will cause these matching algorithms to take forever.

**Further Reading**

Here is a brilliant paper on implementing regular expression matchers.

## 2 comments:

I'm all for Regex, but you've totally lost me in the pumping lemma bit. Wikipedia does not help here - here's the "formal" way of expressing the pumping lemma. The image of the equation could be upside down, I wouldn't know :) http://upload.wikimedia.org/math/b/e/b/beb60549a35aeabf66a3464161ba17d0.png

Hi Ilya,

that formula does look rather dense (I often find Wikipedia quite unhelpfully lacking in perspicacity), but it does make sense.

Here's how you translate it:

* forall subsets L of Sigma* ...

Sigma is the "alphabet" of symbols, Sigma* is the set of strings you can make from that alphabet. So this first line says "for all languages (i.e., sets of 'legal' words) L you can make up using words from the alphabet Sigma..."

* regular(L) => ...

This says "if L is regular (i.e., can be matched using a finite state machine), then ...",

* exists p >= 1. forall w in L. |w| >= p =>

This says "there exists a positive p such that for all words w in L, if the length of w is at least p, then ..."

* exists x, y, z in Sigma* ...

"There are words x, y, z such that ..."

* w = xyz /\ ...

"w can be written as the concatenation of x, y, z and ..."

* |y| >= 1 /\ |xy| <= p /\ ...

"the length of y is at least 1 and the length of xy is no more than p and ..."

* forall i >= 0. xy^iz in L

"every word formed by x, then any number of ys, then z must also be in the language L".

Post a Comment