2000 words about arrays and tables
THEY'RE JUST FUNCTIONS
I'm way too discombobulated from getting next month's release of Logic for Programmers ready, so I'm pulling a idea from the slush pile. Basically I wanted to come up with a mental model of arrays as a concept that explained APL-style multidimensional arrays and tables but also why there weren't multitables.
So, arrays. In all languages they are basically the same: they map a sequence of numbers (I'll use 1..N
)1 to homogeneous values (values of a single type). This is in contrast to the other two foundational types, associative arrays (which map an arbitrary type to homogeneous values) and structs (which map a fixed set of keys to heterogeneous values). Arrays appear in PLs earlier than the other two, possibly because they have the simplest implementation and the most obvious application to scientific computing. The OG FORTRAN had arrays.
I'm interested in two structural extensions to arrays. The first, found in languages like nushell and frameworks like Pandas, is the table. Tables have string keys like a struct and indexes like an array. Each row is a struct, so you can get "all values in this column" or "all values for this row". They're heavily used in databases and data science.
The other extension is the N-dimensional array, mostly seen in APLs like Dyalog and J. Think of this like arrays-of-arrays(-of-arrays), except all arrays at the same depth have the same length. So [[1,2,3],[4]]
is not a 2D array, but [[1,2,3],[4,5,6]]
is. This means that N-arrays can be queried on any axis.
]x =: i. 3 3
0 1 2
3 4 5
6 7 8
0 { x NB. first row
0 1 2
0 {"1 x NB. first column
0 3 6
So, I've had some ideas on a conceptual model of arrays that explains all of these variations and possibly predicts new variations. I wrote up my notes and did the bare minimum of editing and polishing. Somehow it ended up being 2000 words.
1-dimensional arrays
A one-dimensional array is a function over 1..N
for some N.
To be clear this is math functions, not programming functions. Programming functions take values of a type and perform computations on them. Math functions take values of a fixed set and return values of another set. So the array [a, b, c, d]
can be represented by the function (1 -> a ++ 2 -> b ++ 3 -> c ++ 4 -> d)
. Let's write the set of all four element character arrays as 1..4 -> char
. 1..4
is the function's domain.
The set of all character arrays is the empty array + the functions with domain 1..1
+ the functions with domain 1..2
+ ... Let's call this set Array[Char]
. Our compilers can enforce that a type belongs to Array[Char]
, but some operations care about the more specific type, like matrix multiplication. This is either checked with the runtime type or, in exotic enough languages, with static dependent types.
(This is actually how TLA+ does things: the basic collection types are functions and sets, and a function with domain 1..N is a sequence.)
2-dimensional arrays
Now take the 3x4 matrix
i. 3 4
0 1 2 3
4 5 6 7
8 9 10 11
There are two equally valid ways to represent the array function:
- A function that takes a row and a column and returns the value at that index, so it would look like
f(r: 1..3, c: 1..4) -> Int
. - A function that takes a row and returns that column as an array, aka another function:
f(r: 1..3) -> g(c: 1..4) -> Int
.2
Man, (2) looks a lot like currying! In Haskell, functions can only have one parameter. If you write (+) 6 10
, (+) 6
first returns a new function f y = y + 6
, and then applies f 10
to get 16. So (+)
has the type signature Int -> Int -> Int
: it's a function that takes an Int
and returns a function of type Int -> Int
.3
Similarly, our 2D array can be represented as an array function that returns array functions: it has type 1..3 -> 1..4 -> Int
, meaning it takes a row index and returns 1..4 -> Int
, aka a single array.
(This differs from conventional array-of-arrays because it forces all of the subarrays to have the same domain, aka the same length. If we wanted to permit ragged arrays, we would instead have the type 1..3 -> Array[Int]
.)
Why is this useful? A couple of reasons. First of all, we can apply function transformations to arrays, like "combinators". For example, we can flip any function of type a -> b -> c
into a function of type b -> a -> c
. So given a function that takes rows and returns columns, we can produce one that takes columns and returns rows. That's just a matrix transposition!
Second, we can extend this to any number of dimensions: a three-dimensional array is one with type 1..M -> 1..N -> 1..O -> V
. We can still use function transformations to rearrange the array along any ordering of axes.
Speaking of dimensions:
What are dimensions, anyway
Okay, so now imagine we have a Row
× Col
grid of pixels, where each pixel is a struct of type Pixel(R: int, G: int, B: int)
. So the array is
Row -> Col -> Pixel
But we can also represent the Pixel struct with a function: Pixel(R: 0, G: 0, B: 255)
is the function where f(R) = 0
, f(G) = 0
, f(B) = 255
, making it a function of type {R, G, B} -> Int
. So the array is actually the function
Row -> Col -> {R, G, B} -> Int
And then we can rearrange the parameters of the function like this:
{R, G, B} -> Row -> Col -> Int
Even though the set {R, G, B}
is not of form 1..N, this clearly has a real meaning: f[R]
is the function mapping each coordinate to that coordinate's red value. What about Row -> {R, G, B} -> Col -> Int
? That's for each row, the 3 × Col array mapping each color to that row's intensities.
Really any finite set can be a "dimension". Recording the monitor over a span of time? Frame -> Row -> Col -> Color -> Int
. Recording a bunch of computers over some time? Computer -> Frame -> Row …
.
This is pretty common in constraint satisfaction! Like if you're conference trying to assign talks to talk slots, your array might be type (Day, Time, Room) -> Talk
, where Day/Time/Room are enumerations.
An implementation constraint is that most programming languages only allow integer indexes, so we have to replace Rooms and Colors with numerical enumerations over the set. As long as the set is finite, this is always possible, and for struct-functions, we can always choose the indexing on the lexicographic ordering of the keys. But we lose type safety.
Why tables are different
One more example: Day -> Hour -> Airport(name: str, flights: int, revenue: USD)
. Can we turn the struct into a dimension like before?
In this case, no. We were able to make Color
an axis because we could turn Pixel
into a Color -> Int
function, and we could only do that because all of the fields of the struct had the same type. This time, the fields are different types. So we can't convert {name, flights, revenue}
into an axis. 4 One thing we can do is convert it to three separate functions:
airport: Day -> Hour -> Str
flights: Day -> Hour -> Int
revenue: Day -> Hour -> USD
But we want to keep all of the data in one place. That's where tables come in: an array-of-structs is isomorphic to a struct-of-arrays:
AirportColumns(
airport: Day -> Hour -> Str,
flights: Day -> Hour -> Int,
revenue: Day -> Hour -> USD,
)
The table is a sort of both representations simultaneously. If this was a pandas dataframe, df["airport"]
would get the airport column, while df.loc[day1]
would get the first day's data. I don't think many table implementations support more than one axis dimension but there's no reason they couldn't.
These are also possible transforms:
Hour -> NamesAreHard(
airport: Day -> Str,
flights: Day -> Int,
revenue: Day -> USD,
)
Day -> Whatever(
airport: Hour -> Str,
flights: Hour -> Int,
revenue: Hour -> USD,
)
In my mental model, the heterogeneous struct acts as a "block" in the array. We can't remove it, we can only push an index into the fields or pull a shared column out. But there's no way to convert a heterogeneous table into an array.
Actually there is a terrible way
Most languages have unions or product types that let us say "this is a string OR integer". So we can make our airport data Day -> Hour -> AirportKey -> Int | Str | USD
. Heck, might as well just say it's Day -> Hour -> AirportKey -> Any
. But would anybody really be mad enough to use that in practice?
Oh wait J does exactly that. J has an opaque datatype called a "box". A "table" is a function Dim1 -> Dim2 -> Box
. You can see some examples of what that looks like here
Misc Thoughts and Questions
The heterogeneity barrier seems like it explains why we don't see multiple axes of table columns, while we do see multiple axes of array dimensions. But is that actually why? Is there a system out there that does have multiple columnar axes?
The array x = [[a, b, a], [b, b, b]]
has type 1..2 -> 1..3 -> {a, b}
. Can we rearrange it to 1..2 -> {a, b} -> 1..3
? No. But we can rearrange it to 1..2 -> {a, b} -> PowerSet(1..3)
, which maps rows and characters to columns with that character. [(a -> {1, 3} ++ b -> {2}), (a -> {} ++ b -> {1, 2, 3}]
.
We can also transform Row -> PowerSet(Col)
into Row -> Col -> Bool
, aka a boolean matrix. This makes sense to me as both forms are means of representing directed graphs.
Are other function combinators useful for thinking about arrays?
Does this model cover pivot tables? Can we extend it to relational data with multiple tables?
Systems Distributed Talk (will be) Online
The premier will be August 6 at 12 CST, here! I'll be there to answer questions / mock my own performance / generally make a fool of myself.
-
Sacrilege! But it turns out in this context, it's easier to use 1-indexing than 0-indexing. In the years since I wrote that article I've settled on "each indexing choice matches different kinds of mathematical work", so mathematicians and computer scientists are best served by being able to choose their index. But software engineers need consistency, and 0-indexing is overall a net better consistency pick. ↩
-
This is right-associative:
a -> b -> c
meansa -> (b -> c)
, not(a -> b) -> c
.(1..3 -> 1..4) -> Int
would be the associative array that maps length-3 arrays to integers. ↩ -
Technically it has type
Num a => a -> a -> a
, since(+)
works on floats too. ↩ -
Notice that if each
Airport
had a unique name, we could pull it out intoAirportName -> Airport(flights, revenue)
, but we still are stuck with two different values. ↩
If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.
My new book, Logic for Programmers, is now in early access! Get it here.