Three approaches to edge cases in data models

everyone

                October 29, 2020

            Three approaches to edge cases in data models

            Edge case poisoning
This post is somewhat a response to Hillel Wayne’s recent post on edge case poisoning. It should be understandable without reading his post, but I recommend starting there if you’ve got the time.
This newsletter is a bit of a first draft. I may try to polish it further into a full blog post, so I would love your feedback.
Hillel uses the example of modeling recipe ingredients to explore the phenomenon of data models “poisoned” by edge cases: as you try to handle more and more edge cases in a data model, you often end up increasing complexity for all users of the data model, even those that might not need to handle that particular edge case. The end result is a system that’s harder for everyone to work with, even the “common case” users.
He contrasts the “naïve” data model for an ingredient list¹:
struct Recipe { ingredients: Vec<Ingredient> }
struct Ingredient { food: String, grams: usize }

with one that handles a large variety of edge cases he encountered in his cookbook of choice:
struct Recipe { subrecipes: Vec<Subrecipe> }
struct Subrecipe {
    required: Vec<Ingredient>,
    optional: Vec<Ingredient>,
}
enum Ingredient {
    Ingredient(Item, Measure),
    Either(Box<Ingredient>, Box<Ingredient>),
}
enum Item {
    Inedible(String),
    Food(String),
    Recipe(Box<Recipe>),
}
enum Measure {
    Mass { grams: usize },
    Quantity { count: usize },
}

Three approaches to edge cases
Hillel talks a bit about solutions to this problem, but he’s mostly focused on defining and describing the problem. I want to take his definition and example as a jumping-off point to talk about three broad approaches or philosophies to addressing this problem. I’m going to use his example to provide concrete data types, but I’m much more interested in the philosophies that underly these solutions than the particular types I present.
I agree with Hillel that the core problem here — dealing with a domain that “feels” simple but is “poisoned” in practice with edge cases that “leak” out of appropriate proportion to their commonality — is ubiquitous in software engineering. I similarly feel that the three philosophies I outline here have fairly broad applicability as lenses for viewing software design.
Ignore the edge cases
The first approach is just to ignore (some of) the edge cases. In this example, perhaps we just adopt Hillel’s initial model as-is, or maybe we we pick just the one or two most relevant extensions and define the rest out of existence. Perhaps we decide that it’s important to ingredients that come in discrete quantities, but that we don’t support optional ingredients, inedible ingredients, or other weird stuff:
struct Recipe { ingredients: Vec<Ingredient> }
struct Ingredient(Item, Measure);
struct Item {
    food: String,
    quantity: Measure,
}
enum Measure {
    Mass { grams: usize },
    Quantity { count: usize },
}

I want to emphasize that this approach isn’t a straw man, and isn’t even necessarily a cop out. The easiest way to solve a problem is to not have it in the first place, and if you can get away with ignoring some unusual edge cases, your system will be simpler for it forever, often in compounding ways.
This fact is, I think, an under-observed part of why building applications for large user bases is so hard: if you want to target a narrow userbase, you often have more freedom to just straight-up ignore edge cases that make your life hard. If you want to build an application that is everything to everyone, though, you have to handle them, even if their complexity leaks everywhere.
By ignoring an edge case, you aren’t necessarily making the decision that it doesn’t exist or that handling it doesn’t have value; you’re just deciding that the cost of poisoning your data model with it is higher than the cost of leaving that feature out. This is a value judgment that can be right or wrong in different contexts for different features.
One of the biggest risk of ignoring edge cases is that it really sucks to be one of the users whose edge cases got defined out of existence. I will gesture vaguely in the directions of Falsehoods Programmers Believe About Names and Seeing Like a State for (what I contend are) two very different pieces on this phenomenon.
Lift edge cases into your types
In writing his recipe example, Hillel made a deliberate choice: for each new edge case he encountered, he modified his data types to explicitly model it, sometimes by drastically changing the data model. He did this, I presume, to make his point about how the mere existence of these edge cases can “poison” the data model and make it more complex for everyone.
However, we can view this choice from a different lens. If we believe that these edge cases are important, and that it’s very important for our system to consider them, we can view this “poisoning” is instead a feature. Yes, the argument goes, this data model does force every user to think about optional ingredients, whether or not they want to. But optional ingredients are real, and any correct implementation will need to deal with them eventually. Isn’t it better that we front-load that thinking, instead of only finding it after we ship?
We can see this dynamic in an example question Hillel poses: For a given data model, how would I find out if a recipe has ingredient X? He (reasonably) presents the complexity of answering this question in our second data model as a cost of the “edge-case poisoned” model. However, on closer inspection, we can view this very complexity as a feature of the data model:
Why do we want to know if a recipe has ingredient X? If we have an allergy to X, it’s probably fine to ignore optional ingredients, or cases where the recipe has substitutions available — we can leave X out and still make the recipe. However, if we happen to have a bunch of X on hand and we’re looking to get rid of it, we might want the opposite semantics! Just because the naïve user thinks that “Does the recipe contain X?” is a simple question doesn’t make it so, and our data model has helped to surface that fact. It has, in fact, gone even further, and thinking carefully about our types even nudges in the direction of a better phrased question! This ability to prod us towards rigorous thinking is, in the best case, a very strong argument for rich data types and careful data modeling.
I also want to call out one downside of this approach that Hillel didn’t mention, although I think it is somewhat implicit in his journey: Representing edge cases explicitly in your types can be brittle. If every new edge case or complexity you encounter means a new type — sometimes, as in the case of adding subrecipes, a drastically new type — adding features can require rewriting large swaths of your code. A good type system and good tools can help, but rarely eliminate that work. And if your data model isn’t just internal, but is, say, exposed over a network API, you may have an even harsher migration in your future. These kinds of rich types, in my experience, work best if you are fortunate enough to be able to get them close-to-right very early on in your development cycle, and then mostly leave them alone.
Incremental data models
The final approach is a bit of a compromise between the two positions. We try to make a data model that “looks like” the simple case, and lets most users mostly treat it that way, but simultaneously makes it possible to represent and work with the weird cases. We might summarize this as the “easy things should be easy, and hard things should be possible” approach².
For our recipe example, we might compromise on something like the following:
struct Recipe {
    name: RecipeName,
    ingredients: Vec<Entry>,
}
struct Entry {
    ingredient: Ingredient,

    // Most recipes will ignore these fields:
    optional: bool,
    alternatives: Vec<Ingredient>,
    recipe: Option<RecipeName>,
}

struct Ingredient(Item, Measure);
// Item and Measure elided, pick your favorite definitions.
// A key point here is that we can mix and match all three approaches
// in the same system.

Hillel makes the argument that the simple data model is “philosophically” correct, and we try to hew to that here: An ingredient list is a single list of entries, each of which has just one ingredient. A consumer who doesn’t want to think about optional ingredients and all those other pesky edge cases can just ignore all fields in Entry other than ingredient, and just blindly look at the Item::name field without first checking the kind.
However, at the same time, this data model is capable of representing (nearly) all the nuance and complexity of the richly typed model³, and a consumer that is careful to always check the appropriate fields before making assumptions can handle them accordingly.
Like all good compromises, this choice gives us either the best of both worlds or the worst of both worlds depending on our tastes:

When writing new code that only worries about the common cases, it’s very nearly as simple as the naïve data model. If I want to write r.ingredients.iter().any(|i| i.ingredient.0.name == food), I can. It’s even right for the case of recipes with no “funny business” going on.
But not quite as much simplicity, and it’s not necessarily clear from the type which fields you can safely “mostly-ignore.” Absent clear documentation or other affordances, our types can be a mess and we might fixate on the wrong things.
It is possible to write code just as correctly as in the type-rich model — as mentioned, we can represent all the same weird cases, so we can also act on them accordingly.
But it’s harder, because the types are “flatter” and the type system gives us less help. This approach is also less likely to make illegal states unrepresentable, so we risk ending up with incoherent combinations of fields that we didn’t in approach (2).
It is easier to evolve — when we added new fields to Entry, we hardly had to update any existing code, which just ignored them.
But when we did, we made vast swaths of existing code potentially subtly wrong — if it should care about those new fields — and the compiler didn’t help us at all.

Is this the right tradeoff? I think, as with so many questions, there isn’t a one-size-fits-all answer. The answer depends on your particular domain, and demands you weigh the concrete costs and benefits for your particular use case. That said, I think one key question (not the only one!) to ask is:

Will the “naive” code that ignores the “edge-case” configurations have somewhat sensible default behavior if handed one of those data models?

In this case, I think the answer is mostly yes — we will essentially operate on a projected version of the recipe where all ingredients are required, we are forced to the first alternate, and we have no knowledge of other recipes in this database. That’s not ideal, but it’s probably fine, and so I would at least consider this approach in designing a recipe system.
Conclusion
This question, I think, is really core to the problem of data modeling in software. I truly think that all three of the approaches I’ve mentioned are appropriate in different contexts. I especially think that the distinction between the three gets at a place where engineers sometimes talk past each other: Option (2) in some sense optimizes for correctness or completeness, whereas (3) and especially (1) prefers to optimize “ease of getting shit done.” Those are different — and both valuable! — goals. If two engineers agree on the characteristics of two designs but weigh those goals differently, they aren’t having a purely technical disagreement, but are rather arguing about values and priorities. And no amount of purely technical argumentation can resolve an underlying values disagreement.
Finally, to close on a perhaps-inflammatory note, I will assert that, as a generalization, (2) is the philosophy of Rust, and (3) is the philosophy of Go. Rust will damn well force you to think about as many edge cases as it can manage up front, and in return your code is more likely to be far more robust. Go, on the flip side, will make it possible to handle those edge cases, but also if you want to just ignore all those error return types and blithely assume everything’s going to be fine, it’s not going to stop you.

I’m translating his pseudocode into Rust for the purposes of this article. I like to use concrete languages for explicitness, and I find Rust’s type syntax pleasingly expressive. ↩

If I had to give the other approaches similar taglines, it might be “Everything should be easy” for (1), and “The easy-thing solution should usually handle the hard things” for (2). ↩

The astute reader may ask: “But what about subrecipes? You’ve left those out!” I haven’t — I’ve just adopted the (undocumented!) convention that any recipe where all ingredients have the recipe field should be treated as a “meta-recipe,” in the way the old data model treated subrecipes. Is that a good idea? Well, it has the virtue that most users can ignore it; but also, it’s easy to accidentally forget about. Those are precisely the common characteristics of this approach. ↩

Don't miss what's next. Subscribe to Musings on computer systems: