Welcome to my newsletter! And: Performance as hardware utilization

percentage of fleet-wide CPU

                October 15, 2020

            Welcome to my newsletter! And: Performance as hardware utilization

            Hello! Welcome to the newsletter. 
I started this year with a goal of writing weekly posts on my blog, but ever since COVID-19 lockdowns started, I've been really struggling to get ideas into a form where I feel confident writing about them. I'm hoping that a switch to a newsletter format will help motivate me to get back into it, and also help me feel more comfortable getting out less-fully-developed ideas, and create a space to share things I’m thinking about without feeling like I need to have concrete answers or frameworks.
I’ve seeding the mailing list with my existing blog’s mailing list. I'm inclined to deprecate that one and also start forwarding blog posts to this list, but if you feel strongly about wanting separate lists, please let me know.
I’d love these posts to start conversations, so if you disagree with something I say or have something to add, I’d love to hear from you! Use the Reply button, or find me on IRC or Twitter.
Performance as hardware utilization
For this inaugural post, I want to talk through an idea I’ve been noodling on for while. I don’t have a concrete takeaway or conclusion, but I think this is an interesting lens and I’d love to hear any thoughts you have.
I want to talk about viewing software performance in terms of the system’s utilization of some underlying resource, typically the underlying hardware. Most concretely, this means finding ways to report performance, not in terms of a wall-clock time or absolute resources consumed, but as a ratio between “hardware resources used” and some higher-level measure of “work performed.” However, I also find it valuable more abstractly as a frame of thought for looking at systems.
Let’s look at some concrete examples, and then I’ll muse about some of the more general or abstracted places I see this idea.
Write amplification in RocksDB
When I helped to run and operate a lot of MongoDB, I paid attention to Facebook’s RocksDB and to the work of Mark Callaghan, who works on Rocks and was working on MongoRocks at Facebook.
The RocksDB developers — like most database teams — think about performance in a lot of ways, but a consistent metric they think and talk about a lot is “write amplification”, often alongside its cousins, read amplification and space amplification. Write amplification is the ratio of write traffic performed by a database engine to the write traffic to the engine itself. If an application is writing 10MB/s of records to the database, and the database is writing 30MB/s to its underlying disk, that configuration is seeing 3x write amplification.
Write amplification is one of the purest examples of the perspective I’m talking about; instead of reporting transactions per second or a latency histogram, you’re measuring performance as “how much hardware capacity do I need to serve a given amount of application traffic?”
I like write amplification — among other reasons — because it gives you a single metric that is useful in all three of these domains:

It’s useful at a very high level, for capacity planning and modeling, since it relates your application’s needs fairly directly to hardware needs.
It’s concrete and measurable; you can run a benchmark with appropriate instrumentation and record it in practice.
It’s comparatively easy to reason about theoretically; you can directly compare, for instance, LSMs and B-Tree storage engines, in terms of their expected amplification.

I think it’s really cool that by choosing the right metric, you can bridge fairly smoothly between such disparate domains and start to create a common language for reasoning about performance and capacity.
Throughput in cycles per byte
I’ve written before about the performance of Sorbet, the Ruby typechecker I helped build at Stripe. We tend to quote Sorbet’s performance as 100,000 lines per second per core, although of course the details vary with hardware and codebase.
Once, when talking to a coworker (a former database engineer, as it happens) about our performance, the engineer performed a translation that surprised me. After hearing our 100,000 lines/s/core headline number, they did a quick mental division, and replied “So, about 40,000 cycles per line.” With the knowledge that modern CPUs run at (rounding to a nice number) 4GHz, this is a simple unit conversion, but it’s one that had never occurred to me, and which I loved. For me, this conversion translates the impressive-sounding but context-less “lines per second” number into something that directly grounds out performance in our usage of the CPU, and helps evaluate whether or not it really is impressive. This conversion, for instance, makes it clear to me that achieving a 10x improvement to Sorbet’s batch throughput would be really challenging; parsing, transforming, and typechecking a codebase with an average of only 4,000 cycles per line sounds extremely challenging.
While this perspective was surprising to me in this context, after some reflection I realized it’s very familiar in others; In cryptography, for instance, hashing or encryption algorithm performance are very often reported in cycles-per-byte. In that case, we often really want to be able to encrypt or hash data just as fast as we can pull it off of a disk or the network, and so this measurement is used to analyze just how far we can push the hardware, and how well the fundamental operations of some algorithm fit to the hardware’s capabilities.
Hash tables as percentage of fleetwide CPU
At CppCon 2017, Matt Kulukundis presented Google’s new C++ hash table, since released in Abseil as flat_hash_map  and friends.
It’s a great talk — I recommend it if high-performance C++ or just low-level performance with a heavy dose of mechanical sympathy are your thing — but one slide in particular stuck with me. Matt presents various benchmarks of the new hash tables, but near the end he shows this slide, which shows the percentage of fleet-wide CPU spent on hash table operations at Google:

This measurement doesn’t have quite nature as the previous ones we’ve looked at, but I think there’s something very similar here; we’re zooming way out, taking some throughput-oriented averages, and looking at how much of our underlying hardware (in this case, the fraction of all computers at Google) that we’re allocating to some operation. We’re also using this perspective to identify sources of overhead that are ripe for optimization. We know Google thinks routinely about performance in this way; their paper Profiling a warehouse-scale computer is another great example of this analysis.
Some thoughts about this perspective
Ok, those are the concrete instances that stick in my mind, but now I want to muse about this idea of viewing performance in terms of efficient usage of underlying resources. Here’s a few thoughts in this direction.
An antidote to performance bloat
It’s a common observation that as computers get faster, software gets slower to compensate, and human-visible performance tends not to improve and often even degrades. When we view performance in terms of the underlying hardware, we treat every computer as having the same performance: 100% of its potential capacity. Measurements in this form are thus somewhat more robust and comparable over time as hardware improves, and help us think about making sure we’re actually getting more efficient over time, not just getting faster because our hardware is improved. SQLite, one of the few software projects that seems to get consistently faster with new releases, measures performance in precisely this way, by benchmarking “cycles per test case” on a simulated reference processor using Cachegrind.
Performance analysis for low-level libraries and APIs
We’re often taught to work on performance exclusively by profiling an application, and optimizing hotspots that are a significant percentage of total runtime. It was when working on the Linux kernel that I started to ask the question: How do you apply this advice to a low-level API like the Linux VFS?
The VFS, in some sense, has to aspire to be everything to everyone — any program accessing the filesystem on Linux has to go through it. So which application(s) do we benchmark to determine if the VFS is fast enough? The answer to “fraction of time spent in the kernel” will vary wildly depending on the application. Furthermore, there’s a Jevons paradox effect: if we make the VFS faster, developers will write more applications that directly use the filesystem (vs a database or other solution) as their data storage, and so the more time we might see spent in the VFS.
It’s not a complete answer, but one partial answer to this conundrum is to throw out the “percentage of application runtime” question for where to focus on optimization, and switch to “hardware utilization”; we can redefine our goal from “be a negligible fraction of our users’ runtime” to “run absolutely as efficiently as we can on the underlying CPU,” and build tools and approaches towards that goal.
I think this observation also helps explain why RocksDB talks about write amplification so much. As a low-level database library, Rocks doesn’t have a single application they can optimize for. Write amplification gives the same kind of absolute target that they can focus on which makes sense regardless of what application is using Rocks, or how.
Speed limits
In a way that I think is common to the previous two bullets, thinking about performance in terms of hardware usage can be productive in terms of finding “speed limits” — limits that articulate the absolute fastest something could hypothetically run — and thinking about either approaching them, or removing them by making different architectural choices.
For instance, for write amplification it’s clear that — barring compression — the best write amplification we could hope for is 1x: every byte must be written to disk. Looking at write amplification thus leads us to ask the question: could we write a useful database that achieves 1.0 write amplification in the limit? What tradeoffs would that entail? I’m not enough of a storage person to know exactly where this question leads, but I suspect it tends towards log-structured systems (including, not coincidentally, the LSM tree that is the basis of RocksDB).
For really micro-scale CPU optimization, Travis Downs has an amazing writeup of all the low-level speed limits that exist on modern superscalar x86 processors, and advice on how to observe which ones you’re hitting and optimize around them.
Capacity planning
If you’re doing capacity planning of any sort — from figuring out how many new servers to provision, or just estimating how close to capacity that one ancient database server in the corner is — you need to understand your system in terms of “percent utilization” and how that will scale as your use case scales.
As a younger engineer, for a long time I had a kind of half-formed idea that the goal for a system should be completely use the available hardware to serve traffic — surely, otherwise you’re leaving performance or money on the table, right?
Simultaneously, I kept noticing that every time the mysqldump cron job ran on my database server, things got slow and we started dropping traffic. Surely we could just tune the I/O scheduler with ionice or something to make this work? Computers are fast, surely we have enough bandwidth to both serve traffic and take periodic backups.
Eventually, learning to think about systems in terms of a fixed bucket of “100% capacity,” which we can allocate to different processes but fundamentally can’t grow beyond “100%” I think gave me a much better perspective for thinking about this kind of problem. Fundamentally, it can’t be the case — over sufficiently long intervals — that you both allocate 100% of resources to serving production traffic, and also leave capacity for background tasks (or serving unexpected traffic surges). It just doesn’t work that way.
(That said, we might be able to allocate multiple tasks — at different priorities — to machines such that we approach 100% utilization, but we have to accept that some of them may be temporarily descheduled if we need that capacity for something else).
Conclusion
That was a lot of words, and I’m still unsure if there’s a “there” there. Do you have any favorite examples of insights reached by zooming out and thinking about performance in terms of hardware efficiency? Do you think these examples and bullet points are all completely unrelated and I’m overgeneralizing in an unhelpful way? I’d love to hear from you either way!

Don't miss what's next. Subscribe to Musings on computer systems: