Named Groups, Regular Expressions, and the Python ‘re’ module.


Some people, when confronted with a problem, think “I know, I’ll use regular expressions.”
Now they have two problems.
−−Jamie Zawinski, in comp.emacs.xemacs

And it is true. Regular expressions, known as REs in short in Python, are a beast. They are powerful and difficult to handle, read, and maintain at the same time. It is tempting to use regular expressions to solve problems involving string parsing. However, it is pretty easy and just as likely to fall into the trap set by regular expressions.

I was reading the Python Regular Expression HOWTO this morning. I came across what is called ”Named groups”. I didn’t know Python’s re module (the regular expression engine) supported named groups. In regular expressions, if a part of the string that is matched against a pattern is desired to be extracted, then the part of the pattern that matches that substring is enclosed within the special metacharacters: “(“ and “)”. A pattern covered with these metacharacters is called a group. The name comes from the semantic nature of parenthesis to group together smaller expressions in mathematics.

In Perl, if one wants to extract the user and domain parts from a standardised e-mail address (I am not considering exceptions here), one could use the following RE:

my $email = ‘ayaz@ayaz.pk’;
my ($user, $domain) = $email =~ m/([\w+.]+)@([\w+.]+)/;

That was a simple RE. There were only two groups to match. If there are more groups, they can be accessed using Perl’s special $1, $2, and so on variables which automatically contain the substrings matched by the grouped part of RE. One has to keep track of the numbers, plus, with REs exhibiting the naughty habit of getting pretty dirty real fast, it can go wrong in all sorts of ways. I can safely say I have been burned by that before: REs having more than ten groups, and I ending up getting confused by the group numbering. I am not sure if Perl has a cleaner workaround for this. I haven’t looked.

Enter the world of the re module in Python. The re module is fascinating and fun to work with. Why? You’ll know when you’ll use it yourself: There are just too many good points to list down and too less space to do that in. I will, however, put light on one of the features of the re module that provides a convenient solution to the problem identified in the above paragraph (and it also happens to be the theme of this post, so I’d make no sense if I didn’t talk about it).

You already know what groups are, and you know that groups are accessed via numbers (remember $1, $2?). What has Python’s re module got that makes it so fascinating with respect to this particular problem? Named Groups. Yes. Named Groups: groups that can be accessed via names as well. Confusing, eh? A quick example will clear it all up.

pattern = re.compile(r'(?P<user>[\w+.]+)@(?P<domain>[\w+.]+)')
address = pattern.search('ayaz@ayaz.pl')
user, domain = (address.group(‘user’), address.group(‘domain’))

Isn’t that simply convenient, intuitive, and beautiful?

3 thoughts on “Named Groups, Regular Expressions, and the Python ‘re’ module.

  1. Support for named groups in Perl is scheduled in newer releases of Perl, namely 5.10.0. However, as I found out, Perl 5.8.x has a hack that comes close to mimicking named group functionality. It is the (?{ code }) constructs. Take a look at perldoc perlreref.

  2. Pyparsing does something very similar, called results names. Here is your e-mail parser implemented using pyparsing:

    >>> from pyparsing import Word,alphanums
    >>> pattern = Word(alphanums+”.”)(“user”) + “@” + Word(alphanums+”.”)(“domain”)
    >>> address = pattern.parseString(“ayaz1@ayaz.pl”)
    >>> address.user
    ‘ayaz1’
    >>> address.domain
    ‘ayaz.pl’
    >>> address.keys()
    [‘domain’, ‘user’]
    >>> address.asList()
    [‘ayaz1’, ‘@’, ‘ayaz.pl’]

    For reasons very similar to those in your post, I encourage pyparsing users to use results names, rather than trying to pick out address[0] as the user and address[2] as the domain (although this is perfectly legitimate). But I would suggest that the pyparsing version is a little easier to follow, and probably more likely to be maintainable (such as, how would you add the ‘-‘ and ‘_’ characters to the allowed user and domain fields?)

    — Paul

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s