Benchmarks That Aren't Your Friends
We’ve now talked twice about an important dimension of a benchmark: the openness of the loop. While there’s more subtlety to it, if what you take away is:
- open-loop is better for measuring latency, and
- closed-loop is better for measuring throughput.
then you’re not going to be in such a bad place.
There’s maybe three main roles someone could find themselves authoring a benchmark in:
- They maintain a system and want to be able to track its performance over time, or to be able to state with some objectivity what its “performance” is,
- they are in the market for a system and want to compare several different possible choices, or,
- they are a member of a committee and are tasked with producing a standard benchmark that vendors can use and advertise.
The last position is what gave us benchmarks like TPC-C.
Designing a benchmark with the last goal in mind is pretty different from the other ones! There's a completely different set of constraints and requirements that don't exist at all in the first two.
What are the different design considerations in a standard benchmark like TPC-C? I can think of a few.
Definition
An internal benchmark can just be specified in code. The code is, in fact, the definition. And this is fine, because its the only actual instantiation of that benchmark that will take place.
TPC-C, on the other hand, has a 132-page document carefully describing its specification, since each vendor will likely have to reimplement it themselves (at least if they want to tune it to match their own system's requirements for the best performance).
The specificity of the document is a consequence of this next point.
Collaboration Orientation
An internal benchmark is collaborative, its goal is to teach the person running it something about the performance of the system under test. It should have lots of knobs that allow the person running it to change the workload around to make things more or less difficult for the system. If the person running the benchmark "cheats" it by doing something against its spirit, they're only really cheating themselves.
A standard benchmark is more adversarial: its goal is that external consumers of the benchmark have a high degree of confidence that a good result on the benchmark suggests actual good real-world performance.
This changes a lot: for example, the internal benchmark doesn't have to define much in terms of correctness criteria, because the benchmark isn't concerned with correctness, the correct behaviour is whatever the database already does. But an external benchmark needs to be as resilient as possible to gaming, so that readers who see a result on said benchmark know it didn't sneak the result in through some back door.
Anecdotally, with YCSB, another popular benchmark, you can change the way that the data is distributed. I heard a story once about a customer of a database vendor who was getting bad performance using the Zipfian distribution setting with YCSB. This distribution is sort of the point of this benchmark. It simulates hotspots and creates a lot of contention compared to something like a uniform distribution, which will spread queries out across the dataset. A very smart sales engineer walked in and said "oh, here's your problem, you've got this configured wrong, you're using Zipfian—let's just change this to uniform." Performance got better, and the customer closed.
Simplicity
When designing an internal benchmark, it's maybe not ideal, but it's okay if the outcome of a given benchmark is complex to parse. Because the author of said benchmark has the knowledge of the system under test to interpret it.
When at all possible, a standardized benchmark should reduce the result of it down to a single number so that consumers can do an apples-to-apples comparison. In TPC-C, this is the "tpmC," which stands for "transactions per minute (C)." It's a little more complicated than that, though.
Providing an easily interpretable number is hard, though! Deceptively so. Workloads are complicated and are often bound by different things for different systems. We don't want someone to be able to get a result easily by scaling up compute but not storage, or vice versa, for example. To solve this, TPC-C has one parameter: the number of warehouses, and it bounds the amount of tpmC you can achieve by the number of warehouses. Each warehouse provides some number of closed-loop workers. The math works out such that for each warehouse, you are allowed to achieve at most 12.8 tpmC. This is one of the most frequently misunderstood aspects of TPC-C, and has led to embarrassing blog posts that required retraction ("look how much we beat this other database by! They only get 12.8 tpmC/warehouse!").
The point is that since each warehouse comes with some amount of data as well, this means you can't easily scale up the number of transactions you're running without also scaling up the amount of data you're managing. This is very clever! But does require planting a stake in the ground about what a reasonable scaling ratio of compute vs. storage is.
Anyway. Because of this dynamic, that once you fix the number of warehouses, the tpmC is bounded above, typically the way TPC-C results are reported is to fix a number of warehouses that you can achieve the max tpmC on, and then also report the cost of achieving it ("$/tpmC," or the dollar cost per tpmC), including all the hardware and software costs.
Buy-In
The TPC was rightly concerned with ensuring that all of the major database vendors at the time would be on-board with TPC-C as a benchmark. If Microsoft, or Oracle isn't using your benchmark, that sort of damages your credibility as any kind of "industry standard." As a result, some political considerations were made in its construction.
End
This isn't even getting in to some of the more nitty-gritty technical decisions made in TPC-C that make it a smart benchmark, but I think it's an interesting thing to think about, what kind of work goes into defining that sort of thing.