Regular expressions and regular expressions

How do the regular expressions of perl square with the regular expressions of theory?

Hera Brown

Department of Computer Science, University of Oxford, UK

When talking about regular languages, we talk about regular expressions, which are convenient languages for defining and specifying particular regular languages. These regular expressions are built up from atoms of the empty string $\emptystring$ and single letters $a, b, c, \dots$ of an alphabet $\alphabet$, disjunctions of regular expressions $E + F$, concatenations of regular expressions $EF$, and the Kleene star $E^*$. These atoms and operations define all the regular languages.

When talking about perl, we talk about regular expressions, which are patterns which we match strings against. These regular expressions are built up from lots of slashes and brackets and dollars. These slashes and brackets and dollars are very useful in file manipulation.

When two things in computer science have the same name, there’s usually a good reason for this, even if the two things look very different. The regular expressions of theory define languages and the regular expressions of perl match against individual strings, but they’re still similar, and the reasons for that similarity are interesting to think about. Here I’ll look at how we can attempt to translate the regular expressions of perl into the regular expressions of theoretical computer science, by going through the specification in Wall’s Programming Perl (or at least the version I have in front of me now). We’ll see if they’re really the same thing (spoiler alert: they’re not).

High-level similarities

First off, let’s address the difference between the languages matched by theoretical regular expressions and the individual strings matched by perl’s regular expressions. Even though perl regular expressions match individual strings we can look at them through the lens of defining whole languages, so let’s get that done.

A perl regular expression $E$ matches a string $S$ just when there’s some substring of $S$ which matches $E$. We say that a substring matches $E$ just when that substring is in the language generated by $E$, when we treat $E$ as a regular expression of theoretical computer science.

With that in mind, we can say that a string $S$ is matched by a perl regular expression $E$ just when $\kleene{\alphabet} E \kleene{\alphabet}$ contains $S$ as a word. Here we’ll view $\alphabet$ as the set of ASCII characters. For instance, when we write something like $\code{\$string\ \texteq\tilde\ \slash{}blah\slash}$, we’re really asking whether $\code{\$code}$ is an element of the language $\alphabet^* \code{blah} \alphabet^*$.

We now have a high-level account of how the regular expressions of perl can be read as the regular expressions of theoretical computer science. Now let’s look closer at the individual parts of the regular expressions of perl.

A formal description

Let’s look first at how the structure of perl regular expressions and theoretical regular expressions agree. Helpfully, in Programming Perl [1] we get a formal overview of what a perl regular expression is, and so let’s look at that formal description now.

Structural similarities

We say that a perl regular expression $E$ matches a string $S$ (i.e. the string is in the language $\alphabet^* E \alphabet^*$ just when any alternatives of the regular expression match the string $S$. Alternatives are delimited by a pipe character $|$, and said alternatives are separated in just the way that the disjunction operator $+$ separates theoretical regular expressions. So this part of the formal descriptions specifies that all perl regular expressions $E$ may be translated into theoretical regular expressions of the form $E_1 + E_2 + \dots + E_n$.

Each of these alternative $E_i$ matches a string, Wall says, just when every item in the alternative matches in the order the items occur. What that means is that a perl regular expression $E_i$ of the form $I_1 I_2 \dots I_m$ matches a string just when the items $I_1, I_2, \dots, I_m$ occur in order in the string.

Note that the items $I_1, I_2, \dots$ don’t necessarily have to be atomic single letters, but they can be. What we’ve seen so far from the formal description is that all perl regular expressions are disjunctions over sequences of items, which we can describe theoretical regular expressions as as well.

Finally, Wall specifies that items $I_1, I_2, \dots$ consist of either assertions or quantified atoms. An assertion asserts that the word being matched appears in a particular location in the string $S$; the four assertions we can use in perl are:

We’ll look at atoms and quantified atoms in the sequel.

In short, structurally, a regular expression of perl is a disjunction over alternatives, where each disjunct is a sequence of items, and where each item is either an assertion or a quantified atom.

Atomic similarities

Let’s now look at what a quantified atom in a perl regular expression is, and how those relate to theoretical regular expressions of the form $\kleene{a}$ where $a$ is a letter of the alphabet $\alphabet$.

Quantified atoms do what they say on the tin; a string matches a quantified atom just when that string is that atom repeated some number of times. The quantifiers we can use in perl, where $\code{a}$ is an atom, are as follows:

Those are the quantifiers; let’s now look at some atoms.

We have a long list of legal atoms that may appear in a perl regular expression; these atoms are:

It would be very nice to provide a neat general formulation of this atom as a theoretical regular expression, but unfortunately this cannot be done! Using backreferencing to substrings, we can match all strings of the form $ww$ for words $w$ in $\alphabet^*$, by writing $\code{(.*) \backslash 1}$. But note that the language $\alphabet^* w w \alphabet^*$ isn’t regular; no finite automaton has the memory to be able to remember arbitrary words $w$. So, in theory, the regular expressions of perl are expressively stronger than the regular expressions of theoretical computer science!

If we’d really like to salvage regularity, though, then we can use the fact that perl will only ever run on a computer with finite memory. Thus, suppose that any string that will ever be represented on your computer has at most $n$ characters from $\alphabet^*$ in it. We can then define the backreference to a substring $\code{(.*)\backslash 1}$, for instance, as the disjunction $w_1w_1 + w_2 w_2 + \dots + w_m w_m$, where $w_1, w_2, \dots, w_m$ is an enumeration of all the strings in $\alphabet^*$ with length less than or equal to $n$. This gets us a regular expression with finite size, and so the language matched by backreferencing substrings comes out regular on this interpretation.

Unfortunately this is cheating, since every finite language is regular; given such a finite language, you just take a big disjunction over every word of the language and put that in a regular expression. It also doesn’t help us in representing the language efficiently, since we don’t get a compact representation of the language we’re denoting (we may well have a disjunction over millions of alternatives, which we can’t efficiently iterate over or otherwise work with when checking perl regular expressions against words).

In any case, we have to augment perl regular expressions here so they’re strictly more expressive than the regular expressions of theoretical computer science. That’s a shame.

Apart from all the above atoms, the only other atoms are backslashed characters not mentioned above, which match the character that’s been backslashed, and characters generally not mentioned above, which match just the characters. It’s this latter type of atom that lets us match against the usual alphanumeric strings.

Conclusion

It’s fun to see why things have the names that they do, even if on the face of it the names don’t make full sense. The regular expressions of perl are like this, in that they don’t look much like the regular expressions of theoretical computer science. Here we’ve looked at some of their similarities, though, and seen why it’s a decent idea to call the perl constructs ‘‘regular expressions’’ (even if they are expressively stronger than this!).

Bibliography