Using Regular Expressions to Find Cthulhu đ
Regular expressions allow you to find patterns in strings. You can always use Javascriptâs indexOf
to find the first occurrence of a simple pattern, but anything beyond that will require regular expressions.
Regular Expressions in Javascript
This is how you define a regular expression in Javascript:
const zipCodeRegex = /^\d{5}(?:[-\s]\d{4})?$/;
This particular regular expression can be used to validate a zip code. It may look confusing, but itâs not so bad once you understand the basic syntax. The only part you need to understand for now is that the regular expression is surrounded by forward slashes (/
).
Experimenting with Regular Expressions
When I need to write a regular expression, I like to start at Regular Expressions 101. Itâs a great tool for testing regular expressions before unleashing them on your code.
Hereâs a quick tour. The first thing youâll want to do as a Javascript developer is to change the regular expressions flavor on the left.
Languages implement regular expressions in slightly different ways. Selecting the Javascript flavor ensures you donât end up writing and testing an expression that doesnât work as you intended in Javascript.
Next, weâll focus on the center section of the app. At the top, you have a text input for your regular expression.
You see the field already has the forward slashes we talked about earlier. It also has a g
after the final slash. This is an option flag. The g
flag makes this regular expression global. That means it will find every occurrence of our pattern, not just the first one.
The large text area under the regular expression field is for the string you want to test your search against.
Here youâll paste in the string you want to search through. The app will highlight any matches for your pattern. This lets you make sure your regular expression is finding what you want.
Finding One String Inside Another
Iâm going to use my regular expression to search H.P. Lovecraftâs The Call of Cthulhu for mentions of Cthulhu himself. I love reading about Cthulhu, but I have a short attention span. The story is pretty long, so I just want the sentences about Cthulhu. Iâll start really simply by using /Cthulhu/g
as my expression. Type that into your regular expression input field and check out the test string to see what that matches.
That gives us the name, but we want the entire sentence. To get there, we need to dig into regular expression syntax and we need to think about what makes a sentence.
Defining a Sentence
We look at sentences every day and can easily identify them, but how can we tell a computer what a sentence is? A sentence begins with a capital letter and ends with some kind of punctuation â a period, a question mark, or an exclamation point â but thatâs not going to be specific enough. You might have a capital letter in the middle of a sentence which would break that definition.
You might define a sentence as having a subject and a predicate. Thatâs true, but regular expressions canât pick those out. We need a simpler way to define this that regular expressions can understand.
If you think about the boundaries of a sentence, thatâs helpful. Weâve already covered the punctuation that ends a sentence. What marks the beginning of a sentence? The end of the previous one!
Special Characters in Regular Expressions
To make things easy on regular expressions, weâve defined a sentence as the characters between two sentence-ending punctuation marks. Most sentences end with a period, so weâll start there. Before we do though, itâs important to know that regular expressions treat certain characters as literal (meaning the actual characters are matched) while other characters are what I like to call magical (meaning they match something besides the character). Itâs easiest to learn by way of example.
A dot in regex (.
) is a special character that matches any character that isnât a newline. If our regular expression is simply /./
, and our string is The quick brown fox jumped over the lazy dog.
, the match would be T
. If we make our regular expression global (/./g
), it will match every single character in the sentence including the spaces and punctuation, each as a separate match.
All that to say we canât match a period in our string just by including a dot in our regular expression. We must escape the dot to match it literally. You can escape any special character by preceding it with a backslash (\
). (Note: Since the backslash itself is a special character used for escaping other characters, you would match a backslash by adding two backslashes to your regular expression like so: \\
.)
Adding Periods to the Expression
Back to our expression. We want to match every sentence with Cthulhu in it, defining a sentence as the characters between two periods. Letâs try this:
/\.Cthulhu\./g;
To break that down, the first forward slash begins your regular expression. The \.
tells regular expression we want to match an actual dot. Then, we have the word we want to match, another literal dot (using \.
), and the forward slash ending our regular expression. The g
matches every occurrence. Letâs test this on Regular Expressions 101.
You can see in the top-right of the screenshot we now have 0 matches. Why? Because now, weâre only matching this string: .Cthulhu.
. We are matching Cthulhu between two periods, but it canât include any other characters. We need to match all the characters between two periods if Cthulhu is among them.
Regular Expression Quantifiers
What characters might be in the sentence and how many might we have? We donât know, so weâll need to use the dot we learned about earlier to match any non-newline character. Weâll also use a new concept called a quantifier. A quantifier tells how many of a character you want to allow in your matched string. Here are the basic quantifiers:
*
- matches 0 or more of the character?
- matches 0 or 1 of the character+
- matches 1 or more of the character{x,y}
- matches betweenx
andy
of the character. Substitute your own numbers forx
andy
. For example,{3,9}
matches between 3 and 9 of the preceding character.
To use a quantifier, just put the quantifier immediately after the character you want to repeat in your match.
First, letâs figure out which is most appropriate for our use case. We have two spots where we can have any number of characters in our match: between the first period and Cthulhu
and between Cthulhu
and the second period. Weâre going to use the dot for the match since a sentence could have any character in those positions. We just need to decide how many of the characters could be in each position. Weâll start with the first position.
We can go ahead and disqualify the {x,y}
since we donât know the minimum or maximum number of characters we might have. Could we have zero characters before Cthulhu
in a sentence? In other words, could our sentence begin with Cthulhu
? Sure, it could! That disqualifies +
since it needs at least one character for a match. Now, we have only *
and ?
remaining. Could we have more than 1 character before Cthulhu
in the sentence? Yes, we could. That disqualifies ?
since it matches no more than a single character leaving us with *
as our quantifier. Hereâs the regular expression now:
/\..*Cthulhu\./g;
Now, letâs look at the other side of the expression. We can ask the same questions, this time about the part of the sentence after Cthulhu
. Iâll skip ahead and tell you the answers are the same. We could have zero or more characters after Cthulhu
in the sentence, so, again, *
is our quantifier of choice.
/\..*Cthulhu.*\./g;
Letâs break down what we have so far.
/
begins the expression\.
matches a period.*
matches any number of any characterCthulhu
matches just that.*
matches any number of any character\.
matches another period/
ends the expressiong
matches globally
If we try that expression, we have a few matches again!
Inspect the matches, though, and youâll see they go too far.
This is happening because .
matches any character that isnât a newline character â even another period! If âCthulhuâ is found in a paragraph, the regular expression match starts at the first period and continues on until it hits the final period before the newline which starts the next paragraph. We need a way to exclude periods from the match except for those that bookend the match. The dot character is too inclusive for our purposes.
Character Classes
Regular expression has some built-in character classes you can use. Weâve used .
which matches any non-newline character. You can match a numerical digit with \d
, a space character with \s
, or a word character with \w
. If those donât meet your needs, you can create your own. Just enclose the characters you want to match inside square brackets to match any one of those characters. Hereâs an example that matches â1,â â2,â or â3â: [123]
Since the dot is too inclusive to match only a single sentence in our Cthulhu hunter expression, we can leverage a character class to match exactly what we want. Weâll pair that with the caret (^
) to negate the class so that it matches every character that is not in the class. (Note: The caret only does negation inside a character class. It serves a different function elsewhere that weâll get to later.)
Negating the character class allows us to match every character that isnât one of our sentence ending punctuation marks.
/\.[^.?!]*Cthulhu[^.?!]*\./g;
Weâve replaced the dot character with our new negated character class which matches any character that isnât a period, a question mark, or an exclamation mark. Take note that, inside the square brackets, you donât need to escape most characters. We were able to drop the backslash in front of the period. The only characters you would have to escape inside the character class are the caret (^
), the right bracket (]
), and the hyphen (-
).
While Iâm here, I might as well add support for matching other sentence-ending punctuation since the character class gives us an easy way to do it.
/[.?!][^.?!]*Cthulhu[^.?!]*[.?!]/g;
Letâs break it all down again.
/
begins the expression[.?!]
matches a period, a question mark, or an exclamation mark[^.?!]*
matches any number of characters which are not a period, a question mark, or an exclamation markCthulhu
matches just that[^.?!]*
matches any number of characters which are not a period, a question mark, or an exclamation mark[.?!]
matches the ending period, question mark, or exclamation mark/
ends the expressiong
matches globally
If you check the matches, weâre getting really close.
Capture Groups
The most glaring issue we have right now is that the preceding sentenceâs punctuation gets matched. We have to match it because we donât know where one sentence ends without it. However, we can also tell regular expressions which parts of the match we care about by using capture groups.
Capture groups allow us to grab parts of the match by surrounding characters in our pattern with parentheses. Each surrounded patternâs match will be pulled out as a capture group. The âMatch Informationâ box in the right sidebar of Regular Expressions 101 will show us the full matches and the capture groups. In Javascript, we can access each capture group independently.
For our expression, weâll want to make sure the first period is not enclosed in the parentheses, but the final period should be since it is part of the sentence being captured.
/[.?!]([^.?!]*Cthulhu[^.?!]*[.?!])/g;
To see if this did the job, weâll check the capture groups under âMatch Informationâ on the right. Hereâs the first matchâs capture group:
Hieroglyphics had covered the walls and pillars, and from some undetermined point below had come a voice that was not a voice; a chaotic sensation which only fancy could transmute into sound, but which he attempted to render by the almost unpronounceable jumble of letters, âCthulhu fhtagnâ.
Thatâs closer. Weâve eliminated the preceding sentenceâs punctuation by using the capture group, but we still have a leading space. In order to eliminate it, weâll need to change one of our character classes.
Since the character class following the initial sentence-ender character class will match spaces, the spaces also get included in our capture group if we surround it with parentheses. We can break up that character class to be able to capture all the other characters but not the leading space characters. To do that, weâll match the space characters before our first [^.?!]
so that those characters get swallowed up before we get into the capture group. We can match space characters including spaces, tabs, and newlines with the \s character class. Weâll want to match any number of them in case the author includes two spaces between sentences. (In the particular passage weâre experimenting with, only a single space is used, but this will make our expression more generally useful.)
/[.?!]\s*([^.?!]*Cthulhu[^.?!]*[.?!])/g;
Now, looking at our âMatch Information,â we can see our capture group contains this:
Hieroglyphics had covered the walls and pillars, and from some undetermined point below had come a voice that was not a voice; a chaotic sensation which only fancy could transmute into sound, but which he attempted to render by the almost unpronounceable jumble of letters, âCthulhu fhtagnâ.
Cleaning Up
Weâre almost there, but we have three edge cases we can clean up to make this perfect.
Capturing Closing Quotes After Sentence-Enders
You may have noticed one of them earlier. You can see the problem if you examine this captured text closely:
What, in substance, both the Esquimau wizards and the Louisiana swamp-priests had chanted to their kindred idols was something very like thisâthe word-divisions being guessed at from traditional breaks in the phrase as chanted aloud: âPhânglui mglwânafh Cthulhu Râlyeh wgahânagl fhtagn.
The problem is that we lose the closing quote of the sentence since it occurs after the period. I believe a closing quote is the only punctuation we should see occurring after the period in the sentence, so weâll just add that as an optional character to match. To optionally match a character, we can use the ?
quantifier. To make it even more robust, Iâd like it to match either a right double quote (â
) or a straight quote ("
) since people use them interchangeably. We can do this with a character class.
/[.?!]\s*([^.?!]*Cthulhu[^.?!]*[.?!]["â]?)/g;
Hereâs that previous capture with this new expression:
What, in substance, both the Esquimau wizards and the Louisiana swamp-priests had chanted to their kindred idols was something very like thisâthe word-divisions being guessed at from traditional breaks in the phrase as chanted aloud: âPhânglui mglwânafh Cthulhu Râlyeh wgahânagl fhtagn.â
Itâs still a little odd with the tab before the quote. Iâm not sure thereâs a great way to deal with that without losing the context before it, and I think that should be there.
Case
Next up, we have one instance of the word âCthulhuâ that isnât matched. Thatâs because itâs in all-caps. We donât care about case for our âCthulhuâ matches, so lets make it case insensitive. Letâs start with two ways we can do this that are the wrong ways in this case.
We could change each character in Cthulhu to a character class containing both the upper and lower case letters. That would look like this: [Cc][Tt][Hh][Uu][Ll][Uu]
. Itâs fine and it works, but itâs hard to type, hard to read, and hard to maintain. If you had a single letter or two, this would be a fine solution. For anything more than that, letâs look elsewhere.
We can activate option flags in the middle of our expression. To do so, you do this: (?<flag>)
replacing <flag>
with the actual character for the flag you want to activate. The flag we want here is i
for âinsensitive.â We can then remove the flag with (?-<flag>)
. Think of that as subtracting the flag. With that change, the âCthulhuâ-matching part of our expression will look like this: (?i)Cthulhu(?-i)
The reason this isnât the right method here is that we know none of our expression needs to be case sensitive. Itâs easier in our case to just make the entire expression insensitive by tacking the i
onto the flags for the entire expression after the closing slash.
/[.?!]\s*([^.?!]*Cthulhu[^.?!]*[.?!]["â]?)/gi;
Generalizing for Other Texts
We can still make one more improvement, though, to make our expression more general. Iâm pretty sure weâve matched every instance of âCthulhuâ in The Call of Cthulhu, but, if we want to be able to match against any text that might mention Cthulhu, thereâs one case weâll miss with our current expression.
Since we start our match at sentence-ending punctuation, we have no way of matching the first sentence. In this case, the first sentence doesnât mention Cthulhu, but what if we want to use this same expression against a text that does mention him in the first sentence? Our expression will miss that sentence entirely.
We need three tools to fix this problem. The first is anchors. An anchor describes where you want the match to appear rather than what you want it to contain. If you want to match only the beginning of your search string, use the caret (^
). To match at the end of your string, use the dollar sign ($
). What we want to do here is to start one of our matches at either a sentence-ender or at the beginning of the string weâre searching through.
This creates a problem because we canât just place the caret inside our existing sentence-ender character class. Doing so would create a negative match for the sentence-enders (meaning we would match anything that isnât a sentence-ender). If we decided to try escaping the caret, that would match a literal caret rather than the beginning of the string. We need a different way to give the expressions two options for what to match.
The second tool is alternation. By placing a pipe character (|
), you can separate two alternatives for a match. a|b
would match either an âaâ or a âbâ for a single character position just like [ab]
. This is just the tool we need to solve our problem since a character class wonât work with an anchor.
Now, we have one small remaining problem. If we just drop the pipe in and add a caret, the entire expression is split around the pipe meaning the expression would match either a sentence ender or the entire remainder of the expression. Hereâs that expression with some extra spacing so you can see the two alternatives:
/[.?!] | ^\s*([^.?!]*Cthulhu[^.?!]*[.?!]["â]?)/;
The alternatives should be between the sentence ender of the start-of-string anchor, but instead weâve split between the sentence-enders and everything else. To fix this, we need a group.
The confusing part about grouping in regular expression is that itâs done with the same mechanism used for capture groups. Surround part of your expression with parentheses to create a group. Splitting inside the group only creates two alternatives within the group. So, weâll enclose our sentence-enders in a group, drop a pipe in after them, and add our anchor to create our alternatives for matching. Here it is:
/([.?!]|^)\s*([^.?!]*Cthulhu[^.?!]*[.?!]["â]?)/gi;
Hereâs why this is especially confusing: not only do the parentheses create a group for the purposes of our alternation. It also creates an additional capture group. Now, instead of a single capture group with the sentence, we have a new capture group containing the previous sentenceâs sentence-ender which occurs before the one we care about. This is easy to work around â just ignore the first capture group â but it could really cause some confusion if you didnât know to look for it. In fact, if you were pulling the sentences out of the first capture group in a Javascript app, youâd now have a bunch of periods instead. Youâd fix this by grabbing the second capture group in your app instead.
To test this, Iâll add a Cthulhu sentence at the beginning and make sure it gets matched. Here are the results of that test:
As you see in the screenshot, our new changes can match the first sentence as well as any of the others.
Final Breakdown
Letâs break down the final regular expression just to be sure we understand all of it. Hereâs the whole thing again:
/([.?!]|^)\s*([^.?!]*Cthulhu[^.?!]*[.?!]["â]?)/gi;
and the breakdown:
/
begins the expression(
starts a capture group. We donât actually care about what is captured here. Weâre just using it to group two alternatives for matching. Here are the two things, either of which could be matched for this group:[.?!]
matches a period, a question mark, or an exclamation mark|
separates the alternatives^
matches the beginning of the string
)
closes the capture group\s*
matches any number of space characters. We eat this up before our second capture group so they donât pollute the sentence we actually want to capture.(
starts the capture group we actually want to capture. Inside, weâll capture the sentence that mentions Cthulhu.[^.?!]*
matches any number of characters which are not a period, a question mark, or an exclamation markCthulhu
matches just that. Remember that, since weâve added thei
flag, the case here no longer matters.[^.?!]*
matches any number of characters which are not a period, a question mark, or an exclamation mark[.?!]
matches the ending period, question mark, or exclamation mark["â]?
 matches zero or one closing quotes (either straight quotes or a right double quote) after the sentence-ending punctuation
)
ends the capture group/
ends the expressiong
matches globallyi
makes the matching case-insensitive
Further Reading
If you need help figuring out how to use Regular Expressions in Javascript, check out my post on how to validate user input in Javascript with regular expressions. By pairing that knowledge with this, youâll have a new powerful tool to add to your web development toolbelt!