toma[yh]to.

An introduction to regular expressions.

04/03/2016

Github Cat
Sample image

Tomayto / tomahto. Potayto/ potahto. Color/ colour. Grey/ gray. These are all commonly used words that people around the world spell differently. Because we're all inclusive and tolerant people, we can read these words and not be bothered too much if it differs from our own spelling preferences. That being said, it's easy to be thrown for a loop by how interchangable spellings can be.

Let's say that me and my good friend Becky are planning on going to the Annual Tomayto Fight in Spain. We divy up the planning responsibilities, type up separate documents detailing our trip, and merge them together on GitHub. Hooray! We're going to Spain to throw tomaytos at each other.

I run to the store in excitement to buy all the tomaytos I need. Becky was in charge of doing the tomayto estimates, so I open up our document and do a quick search for "tomayto estimate" to find my answer. But wait! Nothing comes up! Becky, did you not do your part?!

After scrolling through the document in a panic, I realize that Becky actually wrote "tomahto estimate." Phew. Just a difference in spelling. I find my answer and buy all the tomahtos I need.

Next, I'm off to pick up our matching t-shirts. I do a quick search in our document for "tomahto t-shirts." Nothing comes up! Shoot, that's right. I had actually spelled it, "tomayto t-shirts."

You can probably see how frustrating this would get over a really long document, and over something more complex than just tomayto/tomahto. That's why we use something called regular expressions. Regular expressions, or regexes, help us search for (and sometimes replace) things really quickly. We could have used a regex to find every instance of tomayto and tomahto in our document by just typing toma[yh]to. This is what we call a character class. Other examples of character classes include gr[ea]y, which searches for "grey" and "gray," and pota[yh]to, which searches for "potayto" and "potahto."

Character classes match only one out of several characters. For example, col[ou]r would search for "color" and "colur," not "colour." If we did want to search for all examples of "color" and "colour," we would have to use the regex colo?r. The question mark makes the preceding token in the regex optional.

Regexes can feel pretty complicated if you're not used to seeing them. For example, we can use regexes to check the vailidity of an email address by using \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b. This uses a combination of special characters like brackets, dots, and slashes to search for something a little more complex than just a string. These special characters are referred to as metacharacters.

There are plenty of resources out there for proper regex syntax, but this tutorial is my personal favorite. Some of the most useful regexes include grouping characters together, anchors, and word boundaries.

Regexes are also different depending on the programming language you may be using. There are regex engines that process your regexes, and these engines are not completely compatible over different languages. This means that you'd be using a different regex "flavor" if you were looking at JavaScript than if you were looking at Ruby.

Keep in mind that regexes are flexible entities. They're efficient in both personal and computing time. That being said, they can be complex and might not search for exactly what you wanted. If you're not getting what you're looking for, remember that a regex engine always returns the left most match. It's overly excited to spit out an answer, so it gives you the first one it finds. Take a look at this example to understand the implications of this behavior.

So the next time Becky decides she wants to sound a little more British, we know how to deal with her. Just throw regexes in her direction, maybe along with some of those toma[yh]tos.