Using Regular Expressions to Find Cthulhu 🐙

Regular expressions allow you to find patterns in strings. You can always use Javascript’s indexOf to find the first occurrence of a simple pattern, but anything beyond that will require regular expressions.

Regular Expressions in Javascript

This is how you define a regular expression in Javascript:

const zipCodeRegex = /^\d{5}(?:[-\s]\d{4})?$/;

This particular regular expression can be used to validate a zip code. It may look confusing, but it’s not so bad once you understand the basic syntax. The only part you need to understand for now is that the regular expression is surrounded by forward slashes (/).

Experimenting with Regular Expressions

When I need to write a regular expression, I like to start at Regular Expressions 101. It’s a great tool for testing regular expressions before unleashing them on your code.

Here’s a quick tour. The first thing you’ll want to do as a Javascript developer is to change the regular expressions flavor on the left.

Languages implement regular expressions in slightly different ways. Selecting the Javascript flavor ensures you don’t end up writing and testing an expression that doesn’t work as you intended in Javascript.

Next, we’ll focus on the center section of the app. At the top, you have a text input for your regular expression.

You see the field already has the forward slashes we talked about earlier. It also has a g after the final slash. This is an option flag. The g flag makes this regular expression global. That means it will find every occurrence of our pattern, not just the first one.

The large text area under the regular expression field is for the string you want to test your search against.

Here you’ll paste in the string you want to search through. The app will highlight any matches for your pattern. This lets you make sure your regular expression is finding what you want.

Finding One String Inside Another

I’m going to use my regular expression to search H.P. Lovecraft’s The Call of Cthulhu for mentions of Cthulhu himself. I love reading about Cthulhu, but I have a short attention span. The story is pretty long, so I just want the sentences about Cthulhu. I’ll start really simply by using /Cthulhu/g as my expression. Type that into your regular expression input field and check out the test string to see what that matches.

That gives us the name, but we want the entire sentence. To get there, we need to dig into regular expression syntax and we need to think about what makes a sentence.

Defining a Sentence

We look at sentences every day and can easily identify them, but how can we tell a computer what a sentence is? A sentence begins with a capital letter and ends with some kind of punctuation — a period, a question mark, or an exclamation point — but that’s not going to be specific enough. You might have a capital letter in the middle of a sentence which would break that definition.

You might define a sentence as having a subject and a predicate. That’s true, but regular expressions can’t pick those out. We need a simpler way to define this that regular expressions can understand.

If you think about the boundaries of a sentence, that’s helpful. We’ve already covered the punctuation that ends a sentence. What marks the beginning of a sentence? The end of the previous one!

Special Characters in Regular Expressions

To make things easy on regular expressions, we’ve defined a sentence as the characters between two sentence-ending punctuation marks. Most sentences end with a period, so we’ll start there. Before we do though, it’s important to know that regular expressions treat certain characters as literal (meaning the actual characters are matched) while other characters are what I like to call magical (meaning they match something besides the character). It’s easiest to learn by way of example.

A dot in regex (.) is a special character that matches any character that isn’t a newline. If our regular expression is simply /./, and our string is The quick brown fox jumped over the lazy dog., the match would be T. If we make our regular expression global (/./g), it will match every single character in the sentence including the spaces and punctuation, each as a separate match.

All that to say we can’t match a period in our string just by including a dot in our regular expression. We must escape the dot to match it literally. You can escape any special character by preceding it with a backslash (\). (Note: Since the backslash itself is a special character used for escaping other characters, you would match a backslash by adding two backslashes to your regular expression like so: \\.)

Adding Periods to the Expression

Back to our expression. We want to match every sentence with Cthulhu in it, defining a sentence as the characters between two periods. Let’s try this:

/\.Cthulhu\./g;

To break that down, the first forward slash begins your regular expression. The \. tells regular expression we want to match an actual dot. Then, we have the word we want to match, another literal dot (using \.), and the forward slash ending our regular expression. The g matches every occurrence. Let’s test this on Regular Expressions 101.

You can see in the top-right of the screenshot we now have 0 matches. Why? Because now, we’re only matching this string: .Cthulhu.. We are matching Cthulhu between two periods, but it can’t include any other characters. We need to match all the characters between two periods if Cthulhu is among them.

Regular Expression Quantifiers

What characters might be in the sentence and how many might we have? We don’t know, so we’ll need to use the dot we learned about earlier to match any non-newline character. We’ll also use a new concept called a quantifier. A quantifier tells how many of a character you want to allow in your matched string. Here are the basic quantifiers:

*- matches 0 or more of the character
?- matches 0 or 1 of the character
+- matches 1 or more of the character
{x,y}- matches between x and y of the character. Substitute your own numbers for x and y. For example, {3,9} matches between 3 and 9 of the preceding character.

To use a quantifier, just put the quantifier immediately after the character you want to repeat in your match.

First, let’s figure out which is most appropriate for our use case. We have two spots where we can have any number of characters in our match: between the first period and Cthulhu and between Cthulhu and the second period. We’re going to use the dot for the match since a sentence could have any character in those positions. We just need to decide how many of the characters could be in each position. We’ll start with the first position.

We can go ahead and disqualify the {x,y} since we don’t know the minimum or maximum number of characters we might have. Could we have zero characters before Cthulhu in a sentence? In other words, could our sentence begin with Cthulhu? Sure, it could! That disqualifies + since it needs at least one character for a match. Now, we have only * and ? remaining. Could we have more than 1 character before Cthulhu in the sentence? Yes, we could. That disqualifies ? since it matches no more than a single character leaving us with * as our quantifier. Here’s the regular expression now:

/\..*Cthulhu\./g;

Now, let’s look at the other side of the expression. We can ask the same questions, this time about the part of the sentence after Cthulhu. I’ll skip ahead and tell you the answers are the same. We could have zero or more characters after Cthulhu in the sentence, so, again, * is our quantifier of choice.

/\..*Cthulhu.*\./g;

Let’s break down what we have so far.

/ begins the expression
\. matches a period
.* matches any number of any character
Cthulhu matches just that
.* matches any number of any character
\. matches another period
/ ends the expression
g matches globally

If we try that expression, we have a few matches again!

Inspect the matches, though, and you’ll see they go too far.

This is happening because . matches any character that isn’t a newline character — even another period! If “Cthulhu” is found in a paragraph, the regular expression match starts at the first period and continues on until it hits the final period before the newline which starts the next paragraph. We need a way to exclude periods from the match except for those that bookend the match. The dot character is too inclusive for our purposes.

Character Classes

Regular expression has some built-in character classes you can use. We’ve used . which matches any non-newline character. You can match a numerical digit with \d, a space character with \s, or a word character with \w. If those don’t meet your needs, you can create your own. Just enclose the characters you want to match inside square brackets to match any one of those characters. Here’s an example that matches “1,” “2,” or “3”: [123]

Since the dot is too inclusive to match only a single sentence in our Cthulhu hunter expression, we can leverage a character class to match exactly what we want. We’ll pair that with the caret (^) to negate the class so that it matches every character that is not in the class. (Note: The caret only does negation inside a character class. It serves a different function elsewhere that we’ll get to later.)

Negating the character class allows us to match every character that isn’t one of our sentence ending punctuation marks.

/\.[^.?!]*Cthulhu[^.?!]*\./g;

We’ve replaced the dot character with our new negated character class which matches any character that isn’t a period, a question mark, or an exclamation mark. Take note that, inside the square brackets, you don’t need to escape most characters. We were able to drop the backslash in front of the period. The only characters you would have to escape inside the character class are the caret (^), the right bracket (]), and the hyphen (-).

While I’m here, I might as well add support for matching other sentence-ending punctuation since the character class gives us an easy way to do it.

/[.?!][^.?!]*Cthulhu[^.?!]*[.?!]/g;

Let’s break it all down again.

/ begins the expression
[.?!] matches a period, a question mark, or an exclamation mark
[^.?!]* matches any number of characters which are not a period, a question mark, or an exclamation mark
Cthulhu matches just that
[^.?!]* matches any number of characters which are not a period, a question mark, or an exclamation mark
[.?!] matches the ending period, question mark, or exclamation mark
/ ends the expression
g matches globally

If you check the matches, we’re getting really close.

Capture Groups

The most glaring issue we have right now is that the preceding sentence’s punctuation gets matched. We have to match it because we don’t know where one sentence ends without it. However, we can also tell regular expressions which parts of the match we care about by using capture groups.

Capture groups allow us to grab parts of the match by surrounding characters in our pattern with parentheses. Each surrounded pattern’s match will be pulled out as a capture group. The “Match Information” box in the right sidebar of Regular Expressions 101 will show us the full matches and the capture groups. In Javascript, we can access each capture group independently.

For our expression, we’ll want to make sure the first period is not enclosed in the parentheses, but the final period should be since it is part of the sentence being captured.

/[.?!]([^.?!]*Cthulhu[^.?!]*[.?!])/g;

To see if this did the job, we’ll check the capture groups under “Match Information” on the right. Here’s the first match’s capture group:

Hieroglyphics had covered the walls and pillars, and from some undetermined point below had come a voice that was not a voice; a chaotic sensation which only fancy could transmute into sound, but which he attempted to render by the almost unpronounceable jumble of letters, “Cthulhu fhtagn”.

That’s closer. We’ve eliminated the preceding sentence’s punctuation by using the capture group, but we still have a leading space. In order to eliminate it, we’ll need to change one of our character classes.

Since the character class following the initial sentence-ender character class will match spaces, the spaces also get included in our capture group if we surround it with parentheses. We can break up that character class to be able to capture all the other characters but not the leading space characters. To do that, we’ll match the space characters before our first [^.?!] so that those characters get swallowed up before we get into the capture group. We can match space characters including spaces, tabs, and newlines with the \s character class. We’ll want to match any number of them in case the author includes two spaces between sentences. (In the particular passage we’re experimenting with, only a single space is used, but this will make our expression more generally useful.)

/[.?!]\s*([^.?!]*Cthulhu[^.?!]*[.?!])/g;

Now, looking at our “Match Information,” we can see our capture group contains this:

Cleaning Up

We’re almost there, but we have three edge cases we can clean up to make this perfect.

Capturing Closing Quotes After Sentence-Enders

You may have noticed one of them earlier. You can see the problem if you examine this captured text closely:

What, in substance, both the Esquimau wizards and the Louisiana swamp-priests had chanted to their kindred idols was something very like this—the word-divisions being guessed at from traditional breaks in the phrase as chanted aloud: “Ph’nglui mglw’nafh Cthulhu R’lyeh wgah’nagl fhtagn.

The problem is that we lose the closing quote of the sentence since it occurs after the period. I believe a closing quote is the only punctuation we should see occurring after the period in the sentence, so we’ll just add that as an optional character to match. To optionally match a character, we can use the ? quantifier. To make it even more robust, I’d like it to match either a right double quote (”) or a straight quote (") since people use them interchangeably. We can do this with a character class.

/[.?!]\s*([^.?!]*Cthulhu[^.?!]*[.?!]["”]?)/g;

Here’s that previous capture with this new expression:

It’s still a little odd with the tab before the quote. I’m not sure there’s a great way to deal with that without losing the context before it, and I think that should be there.

Case

Next up, we have one instance of the word “Cthulhu” that isn’t matched. That’s because it’s in all-caps. We don’t care about case for our “Cthulhu” matches, so lets make it case insensitive. Let’s start with two ways we can do this that are the wrong ways in this case.

We could change each character in Cthulhu to a character class containing both the upper and lower case letters. That would look like this: [Cc][Tt][Hh][Uu][Ll][Uu]. It’s fine and it works, but it’s hard to type, hard to read, and hard to maintain. If you had a single letter or two, this would be a fine solution. For anything more than that, let’s look elsewhere.

We can activate option flags in the middle of our expression. To do so, you do this: (?<flag>) replacing <flag> with the actual character for the flag you want to activate. The flag we want here is i for “insensitive.” We can then remove the flag with (?-<flag>). Think of that as subtracting the flag. With that change, the “Cthulhu”-matching part of our expression will look like this: (?i)Cthulhu(?-i)

The reason this isn’t the right method here is that we know none of our expression needs to be case sensitive. It’s easier in our case to just make the entire expression insensitive by tacking the i onto the flags for the entire expression after the closing slash.

/[.?!]\s*([^.?!]*Cthulhu[^.?!]*[.?!]["”]?)/gi;

Generalizing for Other Texts

We can still make one more improvement, though, to make our expression more general. I’m pretty sure we’ve matched every instance of “Cthulhu” in The Call of Cthulhu, but, if we want to be able to match against any text that might mention Cthulhu, there’s one case we’ll miss with our current expression.

Since we start our match at sentence-ending punctuation, we have no way of matching the first sentence. In this case, the first sentence doesn’t mention Cthulhu, but what if we want to use this same expression against a text that does mention him in the first sentence? Our expression will miss that sentence entirely.

We need three tools to fix this problem. The first is anchors. An anchor describes where you want the match to appear rather than what you want it to contain. If you want to match only the beginning of your search string, use the caret (^). To match at the end of your string, use the dollar sign ($). What we want to do here is to start one of our matches at either a sentence-ender or at the beginning of the string we’re searching through.

This creates a problem because we can’t just place the caret inside our existing sentence-ender character class. Doing so would create a negative match for the sentence-enders (meaning we would match anything that isn’t a sentence-ender). If we decided to try escaping the caret, that would match a literal caret rather than the beginning of the string. We need a different way to give the expressions two options for what to match.

The second tool is alternation. By placing a pipe character (|), you can separate two alternatives for a match. a|b would match either an “a” or a “b” for a single character position just like [ab]. This is just the tool we need to solve our problem since a character class won’t work with an anchor.

Now, we have one small remaining problem. If we just drop the pipe in and add a caret, the entire expression is split around the pipe meaning the expression would match either a sentence ender or the entire remainder of the expression. Here’s that expression with some extra spacing so you can see the two alternatives:

/[.?!] | ^\s*([^.?!]*Cthulhu[^.?!]*[.?!]["”]?)/;

The alternatives should be between the sentence ender of the start-of-string anchor, but instead we’ve split between the sentence-enders and everything else. To fix this, we need a group.

The confusing part about grouping in regular expression is that it’s done with the same mechanism used for capture groups. Surround part of your expression with parentheses to create a group. Splitting inside the group only creates two alternatives within the group. So, we’ll enclose our sentence-enders in a group, drop a pipe in after them, and add our anchor to create our alternatives for matching. Here it is:

/([.?!]|^)\s*([^.?!]*Cthulhu[^.?!]*[.?!]["”]?)/gi;

Here’s why this is especially confusing: not only do the parentheses create a group for the purposes of our alternation. It also creates an additional capture group. Now, instead of a single capture group with the sentence, we have a new capture group containing the previous sentence’s sentence-ender which occurs before the one we care about. This is easy to work around — just ignore the first capture group — but it could really cause some confusion if you didn’t know to look for it. In fact, if you were pulling the sentences out of the first capture group in a Javascript app, you’d now have a bunch of periods instead. You’d fix this by grabbing the second capture group in your app instead.

To test this, I’ll add a Cthulhu sentence at the beginning and make sure it gets matched. Here are the results of that test:

As you see in the screenshot, our new changes can match the first sentence as well as any of the others.

Final Breakdown

Let’s break down the final regular expression just to be sure we understand all of it. Here’s the whole thing again: