Why You Should Read "Data and Reality"
Once more: we are not modeling reality, but the way information about reality is processed, by people. — Bill Kent
I've got this working theory that you can tell how enthusiastic someone is about a subject by how obscure their favorite book is. It's just a simple numbers game: there are more obscure books than popular books, so there are more obscure good books than popular good books, so the more books you read the more likely your favorite is an obscure one. In other words, if someone says they love fantasy novels and also their favorite novel is Harry Potter, odds are they don't actually read that much fantasy.
I'm saying this because my favorite book on software is incredibly obscure, which must mean I'm a super cool software dude. That's how logic works, right?
It's Data and Reality, by Bill Kent. Unlike most books on data modeling, Data and Reality doesn't tell you how to do anything, or give you advice on the best way to model. Instead, it's a philosophical book on the nature of information and how we represent it. Kent doesn't want you to follow his advice, he wants you to ask questions, to understand just what it is we're trying to do. He opens the book with a whirlwind of examples where our intuition of meaning breaks down:
A “book” may denote something bound together as one physical unit. Thus a single long novel may be printed in two physical parts. When we recognize the ambiguity, we sometimes try to avoid it by agreeing to use the term “volume” in a certain way, but we are not always consistent. Sometimes several “volumes” are bound into one physical “book”. We now have as plausible perceptions: the one book written by an author, the two books in the library’s title files (Vol. I and Vol. II), and the ten books on the shelf of the library which has five copies of everything. Incidentally, the converse sometimes also happens, as when several novels are published as one physical book (e.g., collected works).
[...]
Transportation schedules and vehicles offer other examples of ambiguities, in the use of such terms as “flight” and “plane” (even if we ignore the other definitions of “plane” having nothing to do with flying machines). What does “catching the same plane every Friday” really mean? It may or may not be the same physical airplane. But if a mechanic is scheduled to service the same plane every Friday, it had better be the same physical airplane. And another thing: if two passengers board a plane together in San Francisco, with one holding a ticket to New York and the other a ticket to Amsterdam, are they on the same flight?
[...]
At the beginning of a mystery, we need to think of the murderer and the butler as two distinct entities, collecting information about each of them separately. After we discover that “the butler did it”, have we established that they are “the same entity”? Shall we require the modeling system to collapse their two representatives into one? I don’t know of any modeling system which can cope with that adequately.
The whole book is great, but the first chapter is truly stellar. By the end you've been challenged on notions of identity, quantity, change, categorization, and existence. By pulling out so many different examples with so many distinct problems, Kent show just how essentially difficult representation is. It'd be one thing if you could say "yeah modeling identity is tough, but otherwise reality is straightforward", but when faced with many different tough aspects, it's so much easier to see that reality is by its nature difficult to model. His aspects are not intended to be comprehensive, but they get you started thinking more carefully.
After the first chapter the pace of the book relaxes a little. Chapter two introduces the "information system", the target of our efforts to encode reality. Note he wrote this in 1979, so a lot of it is either now obsolete or common knowledge. But some of it is still useful, so at least give it a skim.
Things pick up again after that: the next few chapters are about specific aspects of modeling information, such as identifiers, relationships, and attributes. It's a deeper dive than chapter one, where Kent tries to understand how the nature of things gives rise to all of the difficult examples he found.
The central problem with the version concept is that we can’t decide whether we are dealing with one thing or several. “The payroll program” is a singular concept, and a command to execute it is implicitly understood to refer to “the current version”. On the other hand, one sometimes refers explicitly to an old version; for example, in order to reconstruct how a certain error occurred last month, one may want to rerun the version of the program that was current then. In this context, we are explicitly aware of the several versions as distinct entities, and have to specify the desired version as part of the naming process.
Chapter 7 switches back to representing data in a system, with a focus on the "record model", or "a fixed linear sequence of field values". I was expecting this section to be mostly obsolete, given we now have relational databases. But I was pleasantly surprised! By "records", Kent means any kind of collection of structured data, and his problems also apply to relational models. For example:
There is no provision for (no way to represent) relationships permitting multiple entity types in one domain, especially when those entity types have very different naming conventions.
Such relationships certainly do exist. Companies, government agencies, schools, and people will usually be treated as distinct entity types ⎯ but any of these might be a person’s employer. We may treat furniture and vehicles as distinct entity types, but they share a common relationship to their manufacturers. As a general example, consider an “owns” relationship: various kinds of things (employees, departments, divisions, locations) can own various kinds of things (furniture, vehicles, supplies, machines, buildings). Potentially each kind of thing might have a different identifier syntax, in terms of length, character set, variability, etc. Even worse, their names might have different qualification structure, e.g., department names are only unique within divisions, and hence a department name must always be qualified by a division name.
Just for good measure, Kent closes out this section with problems specific to relational, hierarchical, and graph models of data. Finally, Kent finishes the book with some discussion of the philosophical and cultural implications of modeling reality. 230 pages total, half the length of Clean Code.
I really, really recommend reading this book. So how do you get your hands on it? That's the tough part. It's out of print, and used copies go for over a hundred dollars. There's no chance it'll ever get a reprint, either. After Kent died in 2005, Technics Publications picked up publication rights. As part of his revisions, the owner of Technics cut out about half the book and replaced it with "Steve's Notes", which are are exceptionally good at missing the point. Remember that paragraph on the murderer and the butler? Here's the "Steve's Notes":
Yes, he advertised his own book right there. It reminds me of scammers who buy old social media accounts and use the accumulated history and reputation to scam people. There are several points where Kent's writing directly contradicts Steve's shilling, so Steve just cut the offending writing out.
In short, "Data and Reality, 3rd Edition" isn't Data and Reality. I'm sure Bill Kent would find that hilarious.
If you want to read the real Data and Reality, you're gonna have to download a PDF of the second edition. Like, for example, this pdf. Louder for the people in the back:
https://github.com/jhulick/bookstuff/blob/master/Data%20and%20Reality.pdfGo read it, it's great
Update for the Internets
This was sent as part of an email newsletter; you can subscribe here. Common topics are software history, formal methods, the theory of software engineering, and silly research dives. Updates are usually 1x a week. I also have a website where I put my more polished and heavily-edited writing (the newsletter is more for off-the-cuff stuff).
If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.