Why do arrays start at 0?
It's not the reason you think. No, it's not that reason either.
I was at my wits end for this newsletter after my first two ideas hit research barriers. Then someone linked me this story about why arrays start at 0 and bam I had my topic. Specifically, arguing that said link is wrong and does not, in fact, fully explain why arrays start at 0.
You should read it for full context, but the gist of Hoye's argument is this:
- Back in the old days, all languages were either 1-indexed or "arbitrarily-indexed", meaning you could pick an "index range" like
0:10
or-1:25
. - Dr. Martin Richards was tasked to write a language for the IBM 7094. That computer had a special job deck to calculate yacht handicaps for the president of IBM, which took priority over all other jobs.
- To keep people from having their jobs cut short by yacht-racing, Richards designed the language to compile as fast as possible. One optimization was setting arrays to start at 0.
- The language, BCPL, went on to inspire B, which went on to inspire C. C was so popular it made everybody switch from 1-indexing to 0-indexing.
I have a lot of problems with this article, like the author quoting Richards and then immediately turning around and claiming he said the opposite, as well as mocking anybody who thinks that 0-indexing could possibly be good:
So if your answer started with “because in C…”, you’ve been repeating a good story you heard one time, without ever asking yourself if it’s true. [...] And that’s the most coherent argument I can find; there are dozens of other arguments for zero-indexing involving “natural numbers” or “elegance” or some other unresearched hippie voodoo nonsense that are either wrong or too dumb to rise to the level of wrong.
I may think that counting 0 as a natural number makes a lot of math more elegant, but clearly I'm just too dumb to rise to the level of wrong.
Regardless, Hoye's core point is it doesn't matter what the practical benefits are, the historical context is that barely anybody used 0-indexing before BCPL came along and ruined everything. He gives two example languages:
Algol 60 uses one-indexed arrays, and arrays in Fortran are arbitrarily indexed – they’re just a range from X to Y, and X and Y don’t even need to be positive integers.
He got these backwards. Algol 60 had index ranges while early FORTRAN was 1-indexed and later moved to index ranges. Both assumed you'd pick the range you'd use for your particular problem.
This roughly matches how it works in the sciences. I skimmed a few math and physics books I have and they regularly switch between 1-indexing and 0-indexing depending on the problem. Analysis, polynomials, and sums seem to always be 0-indexed, while matrix operations and enumerating membership is always done 1-indexed. Since a lot of early computer development was done to support the hard sciences, I'm not surprised they tried to be flexible.
What about other early languages?
Hoye claims that both 1-indexing and index ranges were arbitrarily widespread, while 0-indexing was nonexistent:
The fact of it is this: before pointers, structs, C and Unix existed, at a time when other languages with a lot of resources and (by the standard of the day) user populations behind them were one- or arbitrarily-indexed, somebody decided that the right thing was for arrays to start at zero.
I went through a lot of different pre-C languages, and here's what I found:
- O-indexed: LISP 1.5, APL
- 1-indexed: APL, BASIC, early FORTRAN
- Index ranges: ALGOL-60, later FORTRAN, CPL, SIMULA, CLU, PL/1, COBOL (I think?), Pascal, ALGOL-68, JOVIAL
(APL is a special case: it did not have index ranges, rather you could pick just the starting index, which must be 0 or 1.)
So fixed 1-indexing was not common. Rather, all the languages with user populations had index ranges. Now this isn't exactly a stunning defense of 0-indexing, because almost all of the index ranged languages had syntactic sugar for 1-indexing. Like in ALGOL array V[0:3]
started at 0, but array V[3]
started at one. Nevertheless, it shows that you can't simply say "0-indexing was never used until BCPL". And in fact if you check contemporaneous papers, at least some of them explicitly choose ranging from 0.
The 0's take over
Now for the interesting question: why did we move from index ranges to 0-indexing? Despite my overal dislike of the article, I can't deny that BCPL likely had a massive influence on this. Most modern languages have at least some C influence in them. But I don't think that's the whole story. Unrelated to BCPL, there are signs that people were seriously considering 0-indexing.
First of all, in 1982, Dijkstra disavowed both index ranges and 1-indexing. Dijkstra isn't someone who's going to be influenced by the trends of C. He even said that ALGOL 60 (his baby) and Pascal got it wrong!
Second, APL can have 0- or 1-indexing, Iverson exclusively uses 0-indexing in his book1 to discuss microcomputers:
Third, wire formats like EBDIC and ASCII "started from 0", since they considered the 0-byte a valid unit.
These all point to one reason why 0-indexing might be preferred: it matches machine semantics more. Hoye casually dismisses this argument by saying
The usual arguments involving pointer arithmetic and incrementing by sizeof(struct) and so forth describe features that are nice enough once you’ve got the hang of them, but they’re also post-facto justifications. This is obvious if you take the most cursory look at the history of programming languages; C inherited its array semantics from B, which inherited them in turn from BCPL, and though BCPL arrays are zero-origin, the language doesn’t support pointer arithmetic, much less data structures.
However, right after that he quotes the inventor of BCPL:
BCPL was essentially designed as typeless language close to machine code. Just as in machine code registers are typically all the same size and contain values that represent almost anything, such as integers, machine addresses, truth values, characters, etc. BCPL has typeless variables just like machine registers capable of representing anything. If a BCPL variable represents a pointer, it points to one or more consecutive words of memory. These words are the same size as BCPL variables. Just as machine code allows address arithmetic so does BCPL, so if p is a pointer p+1 is a pointer to the next word after the one p points to. Naturally p+0 has the same value as p.
This directly contradicts what Richards is saying, so Hoye just handwaves it away as really being about compilation efficiency, not about mechanical sympathy.
Unfounded speculation
Originally languages used index ranges because that let people pick the appropriate abstraction for their problem. They defaulted to 1 because that's what humans are used to. Over time, languages moved away from index ranges maybe because it was too complicated to implement, and/or made collaboration harder (you don't know your imported library's style), and/or made resizing arrays harder. Some languages like BASIC went for 1-indexing because of familiarity. Other languages, like BCPL, went for 0-indexing because that's closer to the machine level, as we see with APL and wire formats.
(And also it represents the natural numbers better, if you're willing to accept some hippie voodoo nonsense.)
If this theory is correct, we'd have seen 0-indexing gradually overtake 1-indexing even without BCPL, but BCPL dramatically accelerated the timeline. And IMO that's the right choice, if you have to pick one, I'd much rather have 0-indexing than 1-indexing.2
I want to reiterate that this is speculation. I don't know if it's possible to find a definite answer, but if it is it would take a lot more research time then I'm willing to put into this newsletter.
Lessons
- Things can have more than one cause. Don't trust easy explanations.
- Always look for information that refutes your theory, not just information that supports it. Hoye stopped as soon as he had a satisfying answer and didn't keep researching.
- Don't be a dick.
- It is tragically easy to trick me into doing free research.
Update for the Internets
This was sent as part of an email newsletter; you can subscribe here. Common topics are software history, formal methods, the theory of software engineering, and silly research dives. Updates are usually 1x a week. I also have a website where I put my polished writing (the newsletter is more for off-the-cuff stuff).
-
Just one example: TLA+ is 1-indexed, meaning that you can't wrap around a list with
l[x % Len(l)]
. You have to dol[IF x % Len(l) = 0 THEN Len(l) ELSE x % Len(l)]
, which is dumb. ↩
If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.