Strings do too many things
The most powerful and terrible of all basis types
No Newsletter next week
TLA+ Workshop and moving places.
Strings do too many things
In the unusual basis types email1 I wrote this about strings:
We use strings for identifiers, human writing, structured data, and grammars. If you instead use symbols for identifiers then you can be more confident a given string isn't an identifier.
I think this is a useful concept and want to explore it a bit more. Consider the following pseudocode:
query = "INSERT INTO " + tablenames["comments"] + " VALUES ('f1rst p0st', 'john@example.com');"
In that line,
- "comments" is an identifier. We don't care about anything besides that it's distinguishable from other identifiers. This category also includes enumerations, like when the states of a light are ("red", "yellow", "green").
- "f1rst p0st" is human writing. It means nothing to the system and something to the people that interact with the system. Most text in the real world is human writing.
- "john@example.com" is structured data. We store it because this database doesn't have a special "email" type. In a programming language we'd instead store it as
Email(local="john", domain="example.com")
.2 - The entire query is a grammar. It's a snippet of the SQL language. For that matter, the whole line is also a piece of grammar which we store in the program's text file.
(The line between "structured data" and "grammar" is really fuzzy; is a CSV data or grammar? Maybe making a distinction isn't useful, but they feel different to me)
When you see a string in code, you want to know what kind of string it is. We use these strings for different purposes and we want to do different things to them. We might want to upcase or downcase identifiers for normalization purposes, but we don't split or find substrings in them. But you can do those operations anyway because all operations are available to all strings. It's like how if you store user ids as integers, you can take the average of two ids. The burden of using strings properly is on the developer.
Almost all programming languages have an ASCII grammar but they need to support Unicode strings, because human writing needs Unicode. How do you lexographically sort a list of records when one of them starts with "𐎄𐎐𐎛𐎍"?
Other problems
I think the conflation of identifiers and strings is "merely" "annoying" in that it adds more cognitive burden.3 The real problems are with data and grammars.
- There's no way to tell a string's nature without context. Is "John wears a hat" human writing, or a kind of data triple, or some kind of unusual source code? Is the triple "{wears} {a hat}" or "{wears a} {hat}"?
- Strings of one nature can contain strings of another nature.
element=<pre>2024-02-07,hello world</pre>
has structured data inside a grammar inside a grammar inside a grammar. And it's all inside human writing inside a markdown grammar that'll be parsed by buttondown's email generator. I will be astounded if it renders properly for everyone.- We often want to store human writing in data/grammars, which means bolting an extra grammar on top of writing. JSON can't contain multiline strings; is the value "Thank you,\nHillel" a string with a linebreak or a string with the text
\n
?
- We often want to store human writing in data/grammars, which means bolting an extra grammar on top of writing. JSON can't contain multiline strings; is the value "Thank you,\nHillel" a string with a linebreak or a string with the text
- Grammars and structured data can be written incorrectly. If the program misunderstands the nature of string, valid data can be transformed into invalid data. Lots of simple programs break on Windows because they interpret the path separate (
\
) as an escape sequence. - Security: SQL injection is possible because the fragment
"WHERE id="+id
expectsid
to contain an identifier, but someone gets it to contain SQL code instead.
The bigger problem: serialization
Given the problems with strings doing too much, the direct approach is to offload stuff from strings. Some languages have native types like symbols; otherwise, you can make special types for identifiers and each different kind of data. Grammars are trickier to offload; I usually see this done as composable functions or methods.
Then you get the problem of getting information out of the program. Files are strings. HTTP messages are strings. Even things like booleans and numbers now are strings. Strings are the lowest common denominator because all languages understand strings. C doesn't know what an abstract base class is but it does know what a string is.
I guess for interprocess communication you've at least got stuff like protobuf and thrift. Otherwise, you could try throwing another grammar in the mix? Instead of storing the email as "john@example.com", store it as
{
"type": "email",
"value": "john@example.com"
}
See also: JSON Schema.
Anyway this is all just another example of "push errors to the boundaries of your code". The real world might be stringly typed, but your program internals don't have to be.
-
Thanks to everybody who sent me other unusual basis types! And the people who pointed out that I forgot sum types as a (near) universal. Feel like an idiot for missing that one ↩
-
This is just a demonstrative example, in the real world
A B <@C :"D E"@F -G-H!>
is a valid email. ↩ -
And while writing this I thought of a really annoying case with identifiers as strings: I've worked with a few APIs that take options as a string. Some read
"foo bar"
as the two identifiersfoo
andbar
, and some read"foo bar"
as the single identifierfoo bar
. ↩
If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.
My new book, Logic for Programmers, is now in early access! Get it here.