The Death of Thread Per Core

yield

        October 20, 2025

The Death of Thread Per Core

            Programming language async runtimes are very focused on handling asynchronous, possibly long running tasks, that might yield for a variety of reasons, that themselves might spawn future work.
In an async runtime like async Rust, the model is that a task can yield, which, conceptually, creates a new piece of work that gets shoved onto the work queues (which is "resume that task"). You might not think of it as "this task is suspended and will be resumed later" as much as "this piece of work is done and has spawned a new piece of work." This new piece of work gets pushed onto a local queue for later processing by the same thread. The primary distinction between thread-per-core approaches and work-stealing approaches is that in work-stealing models, if one thread doesn't have enough work to do, it can "steal" that task and move it over to its own queue.
This has several immediate consequences:

It has to be okay to move those pieces of work across thread boundaries. This is the cause of people's frustration with their futures having to be Send, in Rust.  
Work can be more evenly balanced. If stealing isn't allowed, then there might be a thread, or handful of threads, with a long work queue, while all the others (and their associated CPU cores) sit idle. Stealing is an elegant solution to that problem.  
If any task can be stolen from any other thread, you lose certain locality guarantees: if you know stealing isn't allowed, and two tasks both operate on similar data, you might hope that they can benefit from sharing cache lines.

In the data processing world, for a couple years there it seemed like the needle had firmly swung in the direction of thread-per-core. Yes, of course you should partition your data across threads—cross-core data movement is the only enemy! Of course skewed data is a problem to be solved at a higher level, the data processing layer is optimized to scream through all the data you give it, so needing to be friendly in how you dish that work out is a small price to pay.
If your keys are basically random, this is great: the benefits are real, data tends to stay in cache, you don’t need slow MESI messages creating contention, and implementation is often dramatically simplified by restricting parallelism to very specific points in the code.
Except it seems like there’s been an increase in dissenters over the last several years: actually, maybe the data processing layer can be the cleanest place to put dynamic reshuffling work. The paper on Morsel-Driven Parallelism proposes some reasons why this kind of exchange focused parallelism might no longer be the best model:

Increasing core counts on high end machines means that improperly handling skewed data distributions are more painful.  
Many traditional bottlenecks, like IO latency, have improved massively since the days where Exchange was state-of-the-art. At that time, ensuring maximum CPU utilization was not so important, since you’d typically be bound by other things, but things like disk speed has improved dramatically in the last 10 years  while CPU speeds have not.

The problem of debating concurrency models for data processing is a bit different than that for programming language async runtimes, I think. We have a lot less heterogenous types of work, and we, as the scheduler, can do a lot more predictive introspection about what data a piece of work is likely to need, and we can manipulate tasks algebraically to even merge or split them up.
This sort of freedom is, I think, another big reason why shared-state concurrency has once again become popular. If you're implementing a query engine, you simply have more insight into the type of work you're going to do, which lets your scheduler make smarter decisions.
On top of all those things, I think another big reason is cultural: the more your data systems scale and need to handle things like multitenancy effectively, the more prone you are to skew you have very little control over. "Solve the skew problem a layer up" is not a particularly effective strategy for certain levels of scale, and you need to just bite the bullet and have systems that have that kind of elasticity built into them directly.

                                Don't miss what's next. Subscribe to NULL BITMAP by Justin Jaffray:

          Join the discussion:

                            M

                                    Michael

                        October 22, 2025, morning

        This really clarified the motivations of Project Loom and the scructured scopes.  Thanks!

Reply
Report

          Add a comment