Are Efficiency and Horizontal Scalability at odds?

close

                February 12, 2025

            Are Efficiency and Horizontal Scalability at odds?

                Why are scalable systems locally-inefficent, and locally-efficient systems unscalable? Plus, new book release!

            Sorry for missing the newsletter last week! I started writing on Monday as normal, and by Wednesday the piece (about the hierarchy of controls ) was 2000 words and not close to done. So now it'll be a blog post sometime later this month.
I also just released a new version of Logic for Programmers! 0.7 adds a bunch of new content (type invariants, modeling access policies, rewrites of the first chapters) but more importantly has new fonts that are more legible than the old ones. Go check it out!
For this week's newsletter I want to brainstorm an idea I've been noodling over for a while. Say we have a computational task, like running a simulation or searching a very large graph, and it's taking too long to complete on a computer. There's generally three things that we can do to make it faster:

Buy a faster computer ("vertical scaling")
Modify the software to use the computer's resources better ("efficiency")
Modify the software to use multiple computers ("horizontal scaling")

(Splitting single-threaded software across multiple threads/processes is sort of a blend of (2) and (3).)
The big benefit of (1) is that we (usually) don't have to make any changes to the software to get a speedup. The downside is that for the past couple of decades computers haven't gotten much faster, except in ways that require recoding (like GPUs and multicore). This means we rely on (2) and (3), and we can do both to a point. I've noticed, though, that horizontal scaling seems to conflict with efficiency. Software optimized to scale well tends to be worse or the N=1 case than software optimized to, um, be optimized. 
Are there reasons to expect this? It seems reasonable that design goals of software are generally in conflict, purely because exclusively optimizing for one property means making decisions that impede other properties. But is there something in the nature of "efficiency" and "horizontal scalability" that make them especially disjoint?
This isn't me trying to explain a fully coherent idea, more me trying to figure this all out to myself. Also I'm probably getting some hardware stuff wrong
Amdahl's Law
According to Amdahl's Law, the maximum speedup by parallelization is constrained by the proportion of the work that can be parallelized. If 80% of algorithm X is parallelizable, the maximum speedup from horizontal scaling is 5x. If algorithm Y is 25% parallelizable, the maximum speedup is only 1.3x. 
If you need horizontal scalability, you want to use algorithm X, even if Y is naturally 3x faster. But if Y was 4x faster, you'd prefer it to X. Maximal scalability means finding the optimal balance between baseline speed and parallelizability. Maximal efficiency means just optimizing baseline speed. 
Coordination Overhead
Distributed algorithms require more coordination. To add a list of numbers in parallel via fork-join, we'd do something like this:

Split the list into N sublists
Fork a new thread/process for sublist
Wait for each thread/process to finish
Add the sums together.

(1), (2), and (3) all add overhead to the algorithm. At the very least, it's extra lines of code to execute, but it can also mean inter-process communication or network hops. Distribution also means you have fewer natural correctness guarantees, so you need more administrative overhead to avoid race conditions. 
Real world example: Historically CPython has a "global interpreter lock" (GIL). In multithreaded code, only one thread could execute Python code at a time (others could execute C code). The newest version supports disabling the GIL, which comes at a 40% overhead for single-threaded programs. Supposedly the difference is because the specializing adaptor optimization isn't thread-safe yet. The Python team is hoping on getting it down to "only" 10%. 

Scaling loses shared resources
I'd say that intra-machine scaling (multiple threads/processes) feels qualitatively different than inter-machine scaling. Part of that is that intra-machine scaling is "capped" while inter-machine is not. But there's also a difference in what assumptions you can make about shared resources. Starting from the baseline of single-threaded program:

Threads have a much harder time sharing CPU caches (you have to manually mess with affinities)
Processes have a much harder time sharing RAM (I think you have to use mmap?)
Machines can't share cache, RAM, or disk, period.

It's a lot easier to solve a problem when the whole thing fits in RAM. But if you split a 50 gb problem across three machines, it doesn't fit in ram by default, even if the machines have 64 gb each. Scaling also means that separate machines can't reuse resources like database connections.
Efficiency comes from limits
I think the two previous points tie together in the idea that maximal efficiency comes from being able to make assumptions about the system. If we know the exact sequence of computations, we can aim to minimize cache misses. If we don't have to worry about thread-safety, tracking references is dramatically simpler. If we have all of the data in a single database, our query planner has more room to work with. At various tiers of scaling these assumptions are no longer guaranteed and we lose the corresponding optimizations.
Sometimes these assumptions are implicit and crop up in odd places. Like if you're working at a scale where you need multiple synced databases, you might want to use UUIDs instead of numbers for keys. But then you lose the assumption "recently inserted rows are close together in the index", which I've read can lead to significant slowdowns. 
This suggests that if you can find a limit somewhere else, you can get both high horizontal scaling and high efficiency. Supposedly the TigerBeetle database has both, but that could be because they limit all records to accounts and transfers. This means every record fits in exactly 128 bytes. [A TigerBeetle engineer reached out to tell me that they do not horizontally scale compute, they distribute across multiple nodes for redundancy. "You can't make it faster by adding more machines."]
Does this mean that "assumptions" could be both "assumptions about the computing environment" and "assumptions about the problem"? In the famous essay Scalability! But at what COST, Frank McSherry shows that his single-threaded laptop could outperform 128-node "big data systems" on PageRank and graph connectivity (via label propagation). Afterwards, he discusses how a different algorithm solves graph connectivity even faster: 

[Union find] is more line of code than label propagation, but it is 10x faster and 100x less embarassing. … The union-find algorithm is fundamentally incompatible with the graph computation approaches Giraph, GraphLab, and GraphX put forward (the so-called “think like a vertex” model).

The interesting thing to me is that his alternate makes more "assumptions" than what he's comparing to. He can "assume" a fixed goal and optimize the code for that goal. The "big data systems" are trying to be general purpose compute platforms and have to pick a model that supports the widest range of possible problems. 
A few years back I wrote clever vs insightful code, I think what I'm trying to say here is that efficiency comes from having insight into your problem and environment.
(Last thought to shove in here: to exploit assumptions, you need control. Carefully arranging your data to fit in L1 doesn't matter if your programming language doesn't let you control where things are stored!)
Is there a cultural aspect?
Maybe there's also a cultural element to this conflict. What if the engineers interested in "efficiency" are different from the engineers interested in "horizontal scaling"?
At my first job the data scientists set up a Hadoop cluster for their relatively small dataset, only a few dozen gigabytes or so. One of the senior software engineers saw this and said "big data is stupid." To prove it, he took one of their example queries, wrote a script in Go to compute the same thing, and optimized it to run faster on his machine.
At the time I was like "yeah, you're right, big data IS stupid!" But I think now that we both missed something obvious: with the "scalable" solution, the data scientists didn't have to write an optimized script for every single query. Optimizing code is hard, adding more machines is easy! 
The highest-tier of horizontal scaling is usually something large businesses want, and large businesses like problems that can be solved purely with money. Maximizing efficiency requires a lot of knowledge-intensive human labour, so is less appealing as an investment. Then again, I've seen a lot of work on making the scalable systems more efficient, such as evenly balancing heterogeneous workloads. Maybe in the largest systems intra-machine efficiency is just too small-scale a problem. 
I'm not sure where this fits in but scaling a volume of tasks conflicts less than scaling individual tasks
If you have 1,000 machines and need to crunch one big graph, you probably want the most scalable algorithm. If you instead have 50,000 small graphs, you probably want the most efficient algorithm, which you then run on all 1,000 machines. When we call a problem embarrassingly parallel, we usually mean it's easy to horizontally scale. But it's also one that's easy to make more efficient, because local optimizations don't affect the scaling! 

Okay that's enough brainstorming for one week.
Blog Rec
Whenever I think about optimization as a skill, the first article that comes to mind is Mat Klad's Push Ifs Up And Fors Down. I'd never have considered on my own that inlining loops into functions could be such a huge performance win. The blog has a lot of other posts on the nuts-and-bolts of systems languages, optimization, and concurrency.

            If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.
My new book, Logic for Programmers, is now in early access! Get it here.

Don't miss what's next. Subscribe to Computer Things:

Join the discussion:

            PJ

                Feb. 13, 2025, midnight

Processes have a much harder time sharing RAM (I think you have to use mmap?)

mmap of course works, but there's also the venerable shmget for shared memory.

Machines can't share cache, RAM, or disk, period.

Be a little careful here; local storage speed matters:  memcached exists because it was discovered that fetching data from a remote machine's memory can be faster than snagging it from local spinning rust. I guess these days  SSDs make memcached obsolete?  

        Reply
Report