Using Regular Expressions and Retaining your Sanity

October 2, 2011    

At a recent Austin, Texas Cocoacoder meeting, I made an offhand comment giving someone a regular expression that would help with a problem they were having. That led to two things. First, I was asked to put together a presentation (which I’ve been working on) on using regular expressions to give at an upcoming CocoaCoder meeting, and second, I was asked why on Earth anyone would use something as opaque and unmaintainable as a regular expression in this day and age.

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

Or so goes the old saying. Personally, I rarely find this to be true. On the occasions when I can identify with that quote, it’s because I’m trying to deal with a regex written by someone who, in my opinion, tried to do far too much at once.

I have a certain philosophy that I try use regular expressions that seems to keep me out of trouble. In this post, I’m going to try to make a set of rules out of that philosophy. They are:

*Limit yourself to only the basic meta-characters.

*Favor clarity over brevity.

*Take more smaller bites.

*Beware of greedy matchingLet’s break these down.

##1. Limit yourself to only the basic meta-characters.

Pretty much every regex tutorial or man page has a giant laundry list of what characters in a regular expression match what characters in the string you’re trying to match. I ignore most of these and look them up when I have to (which is only when I’m looking at other people’s code). I use a few “phrases” over and over, so let me go through some of those to try to give you some examples:

^.* means “the junk to the left of what I want”

This breaks down as ^ (the beginning of the string) followed by .* any number of any character. Likewise:

.*$ means “the junk to the right of what I want”

This breaks down as any number of any character .* followed by $ (the end of the string)

[0–9][0–9]* means “a number with at least one digit”

The brackets ( [ and ] ) mean “any of the characters contained within the brackets”. So this means 1 character of 0–9 (so 0 1 2 3 4 5 6 7 8 or 9) followed by zero or more of the same character.

[^A-Za-z] means “any character that’s not a letter”

The ^ as the first character inside the brackets reverses means “not” so instead of meaning “any letter” it means “anything that isn’t a letter”. Likewise, [^0-9] means “anything that isn’t a number”.

. a literal dot

So this is what you’d use to match the . in .com

( …stuff… ) stuff I want to refer to later

This causes the string matched inside the parenthesis to be retrieved later.

##2. Favor Clarity Over Brevity

There are a lot of shortcuts you can take, like \w means “any character that appears in a word” (I know because I just looked it up). Don’t use it or things like it. Because \W (that’s a capital ‘W’) means “not a word” and they look too much alike. Also, the mnemonic is rubbish, does “W” stand for “Word” or “Whitespace”? Use [A-Za-z] instead. It’s clearer when you’re writing it, and clearer when you look at it later. It’s more keystrokes, but it’s worth it. Likewise there’s a ‘+’ which means “one or more” so [0–9]+ is equivalent of [0–9][0–9] , but if you always do it the second way, there’s one less special character to remember and you won’t ever accidentally type [0–9] when you meant [0–9]+ and accidentally match nothing.

##3. Take more smaller bites.

This is really the core of it. I often use several different regular expressions to get one string I want. It’s more typing and it may potentially create some intermediate strings that have to be thrown away, but it’s much easier to read. So if I were trying to extract an HREF link from an HTML document, for example, I’d use two regular expressions:

^.*href=["’]

and

["’].*$

The first one of those matches everything up to the single or double quote before the URL, and the second one matches from the quote after the URL through the rest of the string. I use a substitute mechanism to throw both of those parts of the string away, and I’m left with the URL I want. *(Note - Those regex’s are simplified for illustration purposes. For example, the = sign might have whitespace around it, etc.)

##4. Beware greedy matching

People get themselves in trouble by forgetting about greedy matching which is the requirement that a


in a pattern matches as much as it can. So let’s say we were doing URL extraction again, and you used the pattern:

^. href=“ (.*)”.$

In theory, that should say (out of the whole string, from beginning ( ^ ) to end ( $ ) grab the thing in between the quotes right after the string href=" and remember it. However, if I gave you the string This is a link but This is a link, too. and you ran your regex against it, you’d get http://1.example.com/“>This is a link but <a href=”http://2.example.com/ as what your regex remembered, because the “.*” grabbed everything between the very first quote and the very last quote. In this case, either use ^. href=“ ([^"]*)”.$ (replacing the .* with [^"] so that the .* doesn’t match quotes) or use more regex’s, and take a smaller bite with each.

That’s a long enough post for now. Next time, I’ll introduce an Objective C category I’m working on and will hopefully have done by then which will simplify using NSRegularExpression, and we’ll work through more examples.