Some people, when confronted with a problem, think “I know, I’ll use regular expressions.”
Now they have two problems.
−−Jamie Zawinski, in comp.emacs.xemacs
And it is true. Regular expressions, known as REs in short in Python, are a beast. They are powerful and difficult to handle, read, and maintain at the same time. It is tempting to use regular expressions to solve problems involving string parsing. However, it is pretty easy and just as likely to fall into the trap set by regular expressions.
I was reading the Python Regular Expression HOWTO this morning. I came across what is called ”Named groups”. I didn’t know Python’s re module (the regular expression engine) supported named groups. In regular expressions, if a part of the string that is matched against a pattern is desired to be extracted, then the part of the pattern that matches that substring is enclosed within the special metacharacters: “(“ and “)”. A pattern covered with these metacharacters is called a group. The name comes from the semantic nature of parenthesis to group together smaller expressions in mathematics.
In Perl, if one wants to extract the user and domain parts from a standardised e-mail address (I am not considering exceptions here), one could use the following RE:
my $email = ‘firstname.lastname@example.org’;
my ($user, $domain) = $email =~ m/([\w+.]+)@([\w+.]+)/;
That was a simple RE. There were only two groups to match. If there are more groups, they can be accessed using Perl’s special $1, $2, and so on variables which automatically contain the substrings matched by the grouped part of RE. One has to keep track of the numbers, plus, with REs exhibiting the naughty habit of getting pretty dirty real fast, it can go wrong in all sorts of ways. I can safely say I have been burned by that before: REs having more than ten groups, and I ending up getting confused by the group numbering. I am not sure if Perl has a cleaner workaround for this. I haven’t looked.
Enter the world of the re module in Python. The re module is fascinating and fun to work with. Why? You’ll know when you’ll use it yourself: There are just too many good points to list down and too less space to do that in. I will, however, put light on one of the features of the re module that provides a convenient solution to the problem identified in the above paragraph (and it also happens to be the theme of this post, so I’d make no sense if I didn’t talk about it).
You already know what groups are, and you know that groups are accessed via numbers (remember $1, $2?). What has Python’s re module got that makes it so fascinating with respect to this particular problem? Named Groups. Yes. Named Groups: groups that can be accessed via names as well. Confusing, eh? A quick example will clear it all up.
pattern = re.compile(r'(?P<user>[\w+.]+)@(?P<domain>[\w+.]+)')
address = pattern.search('email@example.com')
user, domain = (address.group(‘user’), address.group(‘domain’))
Isn’t that simply convenient, intuitive, and beautiful?