Columnar Storage is Normalization

was

        April 20, 2026

Columnar Storage is Normalization

        Something I didn't understand for a while is that the process of turning row-oriented data into column-oriented data isn't a totally bespoke, foreign concept in the realm of databases. It's still of the relational abstraction. Or can be.
As an example, say we have this data:
data = [
    { "name": "Smudge", "colour": "black" },
    { "name": "Sissel", "colour": "grey" },
    { "name": "Hamlet", "colour": "black" }
]

This represents a table in a relational database. Let's assume this was a table in a relational database and we had to do all sorts of disk-access, whatever, to access any particular part of the data. This representation has some nice properties.
It's easy to add a new row: we can just construct a row:
{ "name": "Petee", "colour": "black" }

and add it to the end of our already-existing list. On disk, we probably only have to touch a couple pages to do it. And if our row were really wide, in that it had a whole bunch of columns, that wouldn't really change. It would still have that nice property.
This is also true of looking up a row. Since all of a row's columns are stored next to each other, it's very fast to just pull that row out from wherever its stored.
Conversely, if we were to want to, say, compute a histogram of the different pet colours, we have to read quite a lot of data we don't care about in order to do so. 
This is a row-oriented representation of the data. A column-oriented representation would look something like this:
data = {
    "name": [
        "Smudge",
        "Sissel",
        "Hamlet"
    ],
    "colour": [
        "black",
        "grey",
        "black"
    ],
}

This has all the opposite tradeoffs of the row-oriented design: if we only care about colour, we can very effectively read only that data. We don't have to read their names at all. But modifying the data, or reading a specific row, becomes harder. We have to go all over the place to do them both. If we want the second row, we have to go to the second index in each column to reconstruct the original row.
So, one way to think about this shaping of the data is that it's encoding level. That it lives at a level of abstraction firmly beneath that of the data model: a SQL engine on top of it can't distinguish between the two, except via the performance characteristics of various queries.
A different way to think about columnarization like this is that it's akin to very extreme type of database normalization.
Instead of one wide table that's represented by a bunch of vectors of data, you might think of columnar data as a set of tables which all have a primary key plus one additional attribute:
Denormalized table:
+----+------+-----+
| id | name | age |
+----+------+-----+
| 12 | Bob  |  30 |
| 93 | Tom  |  35 |
| 27 | Kim  |  28 |
+----+------+-----+

Normalized tables:
Name
+----+------+
| id | name |
+----+------+
| 12 | Bob  |
| 93 | Tom  |
| 27 | Kim  |
+----+------+

Age
+----+-----+
| id | age |
+----+-----+
| 12 |  30 |
| 93 |  35 |
| 27 |  28 |
+----+-----+

We can easily reconstruct the original table with a join on the id column.
In the context of a columnar-stored table, you can think of the primary key as the ordinal position of a given piece of data.
Our original data:
data = {
    "name": [
        "Smudge",
        "Sissel",
        "Hamlet"
    ],
    "colour": [
        "black",
        "grey",
        "black"
    ],
}

Looks like this:
+----+--------+
| id |   name |
+----+--------+
|  0 | Smudge |
|  1 | Sissel |
|  2 | Hamlet |
+----+--------+

+----+--------+
| id | colour |
+----+--------+
|  0 |  black |
|  1 |   grey |
|  2 |  black |
+----+--------+

However, the id column is just implied by the position in the arrays:
+--------+
|   name |
+--------+
| Smudge |
| Sissel |
| Hamlet |
+--------+

+--------+
| colour |
+--------+
|  black |
|   grey |
|  black |
+--------+

I think the value of this perspective is that it unifies a lot of traditional query-processing operations, like projections, and joins, with manipulation of data formats. Some, many, most times, you probably should think about data formats like this as an implementation detail that queries are logically blind to, but it's a useful mental model to realize that "reconstructing a row from columnar storage" doesn't just look like performing a join, it is a join.

                                Don't miss what's next. Subscribe to NULL BITMAP by Justin Jaffray:

            Email address (required)

                    ← Newer

                Where Optimizations Come From

                    Older →

                More tools for testing SQL dialects

          Add a comment:

    You're not signed in. Posting this comment will subscribe you to this newsletter with the email address you enter below.