Giving Benchmarks a Boat

through the water

        July 28, 2025

Giving Benchmarks a Boat

            I've been thinking lately about this piece from Frank Lantz about the Thielian "the Olympics, but you take performance-enhancing drugs." The pitch being "take the rate-limiters off and see how fast this baby (the human body) can go."
Lantz identifies this whole premise as arbitrary and morally suspect:

The point of swimming is not to go as fast as fast as possible. It’s not even to go as fast as possible through the water. If you wanted to use the power of science and innovation to enhance a human’s ability to go as fast as possible through the water, you wouldn’t give them steroids, you would give them a boat.

"Sports," or more specifically, the rules of sports, things like "you are not allowed to use performance-enhancing drugs" define criteria that allows for competition. There are of course considerations for how you might design those criteria, but one major one is "do we, as observers, agree that these skills are important things to be measuring." Part of why we care about the results of an Olympic swimming competition is that we think it's cool or important for people to swim as fast as possible. In his piece, Frank lays out why a more "juiced" competition does not lead to a better competition.
TPC-C is a benchmark for transactional databases that models an order system with goods spanning many warehouses.
TPC-C is a particular kind of benchmark, a standard benchmark, which is constructed with the goal of allowing vendors to advertise their results in a standardized way. This is to allow potential customers confidence that the benchmark being used has been vetted. Hopefully, this means performance will imply performance for their workload.
Many dimensions matter for workloads, which makes them hard to compare apples-to-apples, for example:

what's the proportion of hot vs. cold data,
what's the ratio of transactions-per-minute to bytes stored,
what's the proportion of read-only queries, vs. simple, point writes, vs. complex read-write transactions,
what proportion of transactions go cross-shard,
and many more things.

It would be impossible, or at least inadvisable, to make a benchmark that had knobs for all of these things. Interpreting the results would be a mess, and if vendors wanted their results to be comparable, they'd have to agree on a particular configuration anyway.
TPC-C solves this by planting a stake in the ground and having a single parameter that determines all of these things, which is the number of warehouses. Each warehouse comes with some set amount of data to be managed and some amount of query traffic that adheres to some specific distribution laid out in the spec.
It's easy to gloss over this so let's make sure we understand it: each warehouse comes with data and query load. You cannot get more data without getting more query load, and you cannot get more query load without getting more data. The space of legal, as defined by TPC-C, data and query loads is defined by a line:

This curve is a decision being made by TPC to specify what a "reasonable" trade-off between these things are, and this curve is part of what makes TPC-C "TPC-C." TPC-C is "well-respected" because people generally like this decision that has been made (well, at least, database vendors do), and by standardizing it, they give said vendors the ability to borrow the TPC's authority by saying "we followed all the rules and got this number." This is a big part of the value in standard benchmarks.
If you live below this line, you are not keeping up with the query load, and if you live above it you are turning knobs that TPC-C does not permit you to. Points above the curve are not "TPC-C." It in fact might actually be quite different from TPC-C. You are throwing away a large number of the decisions made by TPC-C. This is inventing a new sport that is not endorsed by anyone.
You could say, well, they're doing more work, we're taking the limiters off, seeing what this baby can really do, and that's not the point. The point is that the designers of TPC-C decided that it was important to maintain a particular ratio between query load and data stored. It was important to require vendors to scale up the amount of data they were managing if they wanted to report a higher transactional load. Once you're sufficiently far from the space of legal workloads, you're stressing different parts of the system than TPC-C is designed to be stressed by. There is a long and storied history of vendors doing this by accident, sometimes even with comparisons to benchmark runs that are following the spec.
The metric that is sensible to report with TPC-C is basically "we sustained the workload at X warehouses with Y hardware at Z cost."
But maybe this is fine: maybe your new benchmark actually is representative of real workloads. Maybe it's more representative, that could be true. It seems likely to me that the design of the original benchmark was heavily biased by Oracle putting its thumb on the scale. But you are borrowing TPC's authority and using a benchmark that they do not, and have never, endorsed.
Anyway:

An unspecified vendor has once again claimed to be using TPC-C when they were definitely not:

(they have since changed this language, but the original is still visible on the wayback machine, and in my opinion the current verbiage is still misleading:)

I do not attribute any malice or intent to mislead to this: it's easy to tell that they were accidentally not actually doing TPC-C because they are using the Percona Sysbench TPC-C scripts which are notorious for misleading people in this way (the README has since been updated after prodding from Alex Miller, so maybe this will happen less often going forward). I'm not writing this to call them out, because this mistake is SO common and easy to make that it's hard to hold someone accountable for it. Database enthusiasts need to know this stuff so the mistake can stop being made.
Even if they didn't cite their source, it would be easy to tell that they were not doing TPC-C correctly because they use 5,000 warehouses (which, for some reason, the Percona scripts let you shard to prevent cross-warehouse transactions) and claim around 19,000 transactions per second, which translates to about 1,140,000 transactions per minute. The maximum permitted number of transactions per minute per warehouse in TPC-C is around 12.8, which means for 5000 warehouses the maximum throughput that can be claimed is around 64,000, which this vendor's benchmarks are firmly outside of.
I've seen this benchmark defended on the grounds that they ran the same load on all of the databases they tested, but I think this is irrelevant if you're going to suggest that you are using some vetted, agreed-upon ruleset, when you are actually making up your own rules.
To return to the Lantz post:

Sports thrive when their edges are clear, when the space of competition is well-defined, like a well-tuned scientific instrument whose results can be trusted to accurately reflect the property being measured.

                                Don't miss what's next. Subscribe to NULL BITMAP by Justin Jaffray:

          Add a comment