You are viewing benlast

Previous Entry | Next Entry

Expressing Oneself, Fluently


I use regular expressions[1] in Python often enough that I know many of the character classes and syntax tricks.

I use regular expressions in Python seldom enough that I forget many of the character classes and syntax tricks.

This is annoying, but I have lived with it. Then I came across a little article in Dr Dobb's Journal (which used to be great, many years ago, but is now a mere ghost of its former glory), in which Al Williams writes about (ab)using the C/C++ compiler to allow regular expressions to be written as:

start + space + zero_or_more + any_of("ABC") + literal(":") + group(digit + one_or_more)

rather than

^\s*[ABC]:(\d+)

I quite like the idea of fluent syntax (though I appreciate that it's not necessarily so appealing if your native language isn't English), but I spend marginally more time writing Python than C++ these days. Also, I like the idea of trying to build a fluent interface in a functional way. So, I started out by writing some examples of what I would want to be able to do:

start().end() would give the minimal empty string regex "^$". Easy enough.

Or how about

any_number_of().digits().followed_by().dot().then().at_least_one().digit()

You get the general idea: the fluent syntax describes the expression and results in a string that matches what it describes. One really good thing is that it avoids the "backslash plague" that can confuse those new to writing regular expressions in Python.

This now exists, and is on github at https://github.com/benlast/grimace and it will do the above and more. Time for some more intense code examples:

The grimace.RE object is our starting point; any method we call on it returns a new RE object. Let's get the regex to match an empty string.

>>> from grimace import RE
>>> print RE().start().end().as_string()
^$

The as_string() call turns the generated expression into a string that can then be used as the argument to the standard Python re module. There's also as_re() which will compile the regular expression for you and return the resulting pattern-matching object.


>>> #Extract the extension of a short DOS/FAT32 filename, using a group to capture that part of the string
>>> regex = RE().start().up_to(8).alphanumerics().dot().group(name="ext").up_to(3).alphanumerics().end_group().end().as_string()
>>> print regex
^\w{0,8}\.(?P\w{0,3})$
>>> #Use the re module to compile the expression and try a match
>>> import re
>>> pattern = re.compile(regex)
>>> pattern.match("abcd.exe").group("ext")
'exe'
>>>
>>> #We can do that even more fluently...
>>> RE().start().group(name="filename").up_to(8).alphanumerics().end_group() \
... .dot().group(name="ext").up_to(3).alphanumerics().end_group() \
... .end().as_re().match("xyz123.doc").groups()
('xyz123', 'doc')('xyz123', 'doc')
>>>
>>> #The cool example that I wrote out as a use case
>>> print RE().any_number_of().digits().followed_by().dot().then().at_least_one().digit().as_string()
\d*\.\d+
>>>


You can also split a complex regex over several lines:

# My python module
from grimace import RE

def is_legal_number(number):
    #Match a US/Canadian phone number - we put the RE() stuff in parentheses so that we don't
    #have to escape the ends of lines
    north_american_number_re = (RE().start()
      .literal('(').followed_by().exactly(3).digits().then().literal(')')
      .then().one().literal("-").then().exactly(3).digits()
      .then().one().dash().followed_by().exactly(4).digits().then().end()
      .as_string())
    number_re = re.compile(north_american_number_re)
    return number_re.match("(123)-456-7890") is not None


There is more to do: control over greedy matching, and ways to express some of the more complex tricks like backreferences and multiple matching subexpressions. And I'll also package it properly for installation via pip. But for now, it's available and it works.

(The documentation for grimace is now in the github wiki page)

[1] I can't write any article on regular expressions without quoting Jamie Zawinski: Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Comments