grep and regular expression revisited


Yesterday, while helping Laiq figure out the pattern he wanted to match against his data set, I bumped into grep’s odd behaviour with respect to regular expressions. Later that day, I went over the man page for grep at length to find out why grep behaved differently than I expected it to.

In the man page for grep there is a separate section on how regular expressions are interpreted within grep. The information provided in that section is brief. It is hard to say whether it provides documentation for a subset of the semantics of regular expressions in grep or the complete set in its entirety. That notwithstanding, I decided to write a (ironically) brief post on grep and its interpretation(s) of regular expressions.

According to the man page, grep understands two different versions of regular expression syntax. One is basic regular expression, and the other is extended regular expression. You would think that both of these versions would differ in more than one ways. However, in terms of functionality provided, basic regular expression is no different than extended regular expression.

The man page lists down only one difference between the two groups of regular expression syntax. With basic regular expression, the meta-characters ?, +, {, |, (, and ) lose their special meaning. To use them in their special capacities, they need to be back-slashed (escaped). In contrast, with extended regular expression, these meta-characters have to be back-slashed only when they are required to be used literally.

If you are trying to, for example, match the string “[000]”, you could work with grep in one of the following ways:

echo '[000]' | grep '\[0\+\]'
echo '[000]' | egrep '\[0+\]'

In the first example, note that the meta-character + is escaped because we want to use it in its special sense, that is as a repetition operator. In the second example, though, the + is already being used in special sense. Also note the use of egrep in second example as opposed to grep. egrep is essentially nothing more than grep -e grep -E (thanks to Zohair for pointing out the error) which triggers the extended regular expression syntax.

Another difference I could spot out is that the available set of named classes of characters do not work with basic regular expression. These named classes of characters include (but are not limited to) [:alnum:], [:alpha:], [:space:]. If there is indeed a way to make use of these named character classes with basic regular expression syntax, I would love to hear about it.

Additionally, I noticed that a limited subset of Perl-based character classes is understood by grep, in both basic and extended interpretations. For example, the \w and \W character classes are understood, but \d and \D are not. If there is a need to match against a numeric pattern, either the named character class [:digit:] or the expanded [0-9] form can be used. As I noted earlier, what is and what is not supported in terms of character classes and many more are not extensively documented in the man page. It is hard to take the absence of documentation of something in the man page to mean that that something is not supported within grep.

echo '[000]' | egrep '\[[0-9]+\]'
echo '[000]' | egrep '\[[:digit:]+\]'

That mostly sums about grep’s limited interpretation of regular expressions, as I see it. If you are a Perl guy and, not least, a regular expression one, you would feel at home with -P switch to grep which allows for regular expression patterns to be specified and interpreted in the same way as Perl interprets regular expressions. So, while the character class \d is not available with grep’s basic and extended regular expression syntax, it can be used readily with the -P option.

echo '[000]' | grep -P '\[\d+\]'

Regular expressions aside, there are a couple of very useful switches to grep that are commonly ignored or are generally never known of despite being clearly documented in the man page. One such gem is the -o switch. The default behaviour of grep is to display the entire line which matches the given pattern. Often it is desirable to display only the part which is matched, and not the entire line. The -o switch does just that.

There is also the -n switch. It prefixes each matched line or part of matched line with the line number from the file where it is found. This can be really helpful in many situations.

The -x switch forces the pattern to match exactly the whole line and not part of the line. This is a useful functionality which is unknown to many.

Like these, there are a handful more switches that can come in real handy. The man page lists all those.

3 thoughts on “grep and regular expression revisited

  1. Pingback: grep and regular expression revisited | Tea Break

  2. I believe egrep is equivalent to grep -E rather than grep -e.

    Or maybe thats just a difference in versions.

    Mine is GNU grep 2.5.3 on a Debian 4.0 Lenny. :\

  3. Thanks for pointing the subtle error out, Zohair. -E triggers the extended regular expression syntax and does not take any arguments. Anything after -e is taken to be the regular expression.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s