Regexes are Cool and Good

does

                January 31, 2022

            Regexes are Cool and Good

Some people, when confronted with a problem, think “I know, I'll use regular expressions.”  Then they solve their problem, hooray!

Regexes are great! I use them all the time! For some reason, though, people seem to hate them. And those "some reasons" are

They're very hard to read, so you have no idea what a given regex you see does.
They're fragile, so slight changes to the input can break your regex. Making them more robust also makes problem (1) worse.

This is usually in the context of putting a regex in your code, where it's used to pull semantic information out of arbitrary texts. In that context, readability and robustness are really important. Simple regexes can work but for anything complicated you probably want to use a parser.
No, where regex really shines is in interactive use. When you're trying to substitute in a single file you have open, or grep a folder, things like that. Readability doesn't matter because you're writing a one-off throwaway, and fragility is fine because you're a human-in-the-loop. If anything goes wrong you will see that and tweak the regex.
Examples
(Note, I'm using vim regex notation because that's where I do all my editing.)
I'm updating learntla this year. I wrote learntla after having only used TLA+ for a few months, and it was just to help other beginners get up to speed. Now that I've been using it (and professionally teaching!) for five years, I know I can write a much, much better resource.
I wrote learntla in Hugo+markdown. As I've said before, markdown is okay for small documents but terrible for complex stuff. There's no semantic markup, so to implement things like notes and tips Hugo use a "shortcode" preprocessor.
% notice tip %
The TLA+ Toolbox maps the F11 key to "Run Model".
% /notice %

By contrast, Sphinx+rST has a built-in extension syntax, making it easy to add semantic markup. If I want to have tips, I can add in a "tip" directive, and then the site generator knows there's a "tip" content type that should be handled differently from everything else. I've since used Sphinx for all my new documentation, so it makes sense to first convert the existing content to rST.
That's harder than it sounds! Normally, I can convert markdown to rST via pandoc. But here's where the shortcode syntax becomes a problem. Since it's used by the preprocessor and not markdown, pandoc doesn't convert the % notice tip % into a directive. I have to sweep over afterwards and make those changes directly.
Or I could write a substitution regex:
%s/% notice $\w\+$ %\n$.\+$\n% \/notice %/.. \1:: \2

// output

.. tip:: The TLA+ Toolbox maps the F11 key to "Run Model".

That's the kind of stuff that regex is good for. Not only is it solving a problem, but it's also obvious that the main issues of regexes aren't problems here:

It's unreadable: I wouldn't expect another person to easily understand how it works. But this is just for me, and nobody else needs to understand it. I don't need to read it either; if I forget how it works later, I can just start over from scratch. 
It's fragile: It doesn't handle shortcodes with multiple lines in the body. But that's okay! Most of my tips are a single line, and I'll know if it doesn't convert something. I can either fix those manually or fix the regex after I run into one of those cases.

Now some of you might be thinking that's an awfully narrow use case. Most of us aren't converting books!
You're right, it's solving a really specific problem that I'll never encounter again. But I run into "really specific problems I'll never encounter again" all the time. Here's some other really specific problems I solved with a regex:

Sweep a document for doubled words, like "the the". /\<$\w\+$\> \<\1\>
Replace all instances of +foo with Foo (yes, I've had to do this!): %s/+$\w$/\U\1
Highlight instances of {{ that aren't the start of a Hugo shortcode.¹ {{\ze[^%]
Quickly swap a bunch of equations of form f(g(x), y) with f(y, g(x)): s/($.\+$, $.\+$)\ze[^)]*$/(\2, \1)

These would all be boring drudgework without regexes.
About that quote...
You know the one I'm talking about. 
As always, it's important to consider the original context of the quote. Jeffrey Friedl tracked it down to this discussion, where somebody proposed embedding Perl into XEmacs. Their justification was that Perl was much better at regex handling than Lisp was. Jamie Zawinski, the maintainer of XEmacs at the time, responded:

You are trying to shoehorn your existing preconceptions of how one should program onto a vastly different (and older, and more internally consistent) model. I suggest your time would be better spent learning and understanding that other model, and learn to use it properly, and learn what it can and cannot do, rather than infecting it with this new cancer out of ignorance.

(Yeah the conversation goes places. And that's not even the beginning of the places it goes!) 
Anyway, Jamie hated Perl and hated how Perl was shoehorned into everything. He didn't think regexes were bad overall:

Regexps are clear as mud. They have many things to recommend them, but clarity is not one.

But they were "the biggest hammer" in the Perl toolbox, so people tried to solve problems with regexes instead of learning the proper tool (in his case, prolly Lisp). Then comes the famous quote. The people who encounter a problem and think "I'll use regular expressions" aren't thinking about the problem. And that's why, by using regex, they now have two: their bad habit and their broken solution. 
Of course, we can make mistakes the other way. If you never use regexes then you're throwing away a tool before thinking about the problem. We say "use the right tool for the job" for a reason!

When writing drafts I surround text I'm expecting to change with double brackets. ↩

            If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.
My new book, Logic for Programmers, is now in early access! Get it here.

Don't miss what's next. Subscribe to Computer Things:

Start the conversation: