Why do regexes use `$` and `^` as line anchors?
A history that will satisfy nobody.
Next week is April Cools! A bunch of tech bloggers will be writing about a bunch of non-tech topics. If you've got a blog come join us! You don't need to drive yourself crazy with a 3000-word hell essay, just write something fun and genuine and out of character for you.
But I am writing a 3000-word hell essay, so I'll keep this one short. Last week I fell into a bit of a rabbit hole: why do regular expressions use $
and ^
as line anchors?1
This talk brings up that they first appeared in Ken Thompson's port of the QED text editor. In his manual he writes:
b) "^" is a regular expression which matches character at the beginning of a line.
c) "$" is a regular expression which matches character before the character
<nl>
(usually at the end of a line)
QED was the precursor to ed, which was instrumental in popularizing regexes, so a lot of its design choices stuck.
Okay, but then why did Ken Thompson choose those characters?
I'll sideline ^
for now and focus on $
. The original QED editor didn't have regular expressions. Its authors (Butler Lampson and Peter Deutsch) wrote an introduction for the ACM. In it they write:
Two minor devices offer additional convenience. The character "." refers to the current line and the character "$" to the last line in the buffer.
So $
already meant "the end of the buffer", and Ken adapted it to mean "the end of the line" in regexes.
Okay, but then why did Deutsch and Lampson use $
for "end of buffer"?
Things get tenuous
The QED paper mentions they wrote it for the SDS-930 mainframe. Wikipedia claims (without references2) that the SDS-930 used a Teletype Model 35 as input devices. The only information I can find about the model 35 is this sales brochure, which has a blurry picture of the keyboard:
I squinted at it really hard and saw that it's missing the []{}\|^_@~
symbols. Of the remaining symbols, $
is by far the most "useless": up until programming it exclusively meant "dollars", whereas even something like #
meant three different things. But also, $
is so important in business that every typewriter has one. So it's a natural pick as the "spare symbol".
Yes this is really tenuous and I'm not happy with it, but it's the best answer I got.
If we're willing to stick with it, we can also use it to explain why Ken chose ^
to mean "beginning of line". ^
isn't used in American English, and the only reason QED wasn't using it was because it wasn't on the Teletype Model 35. But Ken's keyboard did have ^
, even when it wasn't standardized at the time, so he was able to use it.
(Why did it have ^
? My best guess is that's because ASCII-67 included it as a diacritic and keyboards were just starting to include all of the ASCII characters. The Teletype 35 brochure says "it follows ASCII" but didn't include many of the symbols, just uses the encoding format.)
So there you have it, an explanation for the regex anchors that kinda makes sense. Remember, April Cools next week!
If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.
My new book, Logic for Programmers, is now in early access! Get it here.