Tracing JITs and coverage-guided fuzzers
By happenstance, I am friends with a handful of engineers that happen to have spent substantial amounts of time working on both the PyPy project and on fuzzing a lot of C and C++ software. This post is an attempt to capture an observation we’ve made about a surprising similarity between those systems.
Coverage-guided fuzzers
Coverage-guided fuzzers, of a lineage pioneered by AFL and now widely branching out into reimplementations and variants, attempt to mostly-automatically generate “interesting” inputs to a program, most typically with the goal of finding potentially-exploitable bugs in memory-unsafe languages. Given zero or more “seed” files, they randomly mutate the files, and preserve inputs that generate new behaviors in the system under test, measuring “new behaviors” using something akin to branch coverage. This causes them to evolve towards inputs that exercise more and more code, increasing the likelihood of triggering bugs.
In order to keep the state space tractable and to avoid rabbit-holing, these tools “flatten” branch coverage into some small summary that they evaluate; e.g. instead of looking at a complete program trace, afl only records bucketized counts of how many times each branch was taken.
Tracing JITs
Tracing just-in-time compilers are a family of JIT compilers that focus primarily on loop optimization. The idea of a tracing JIT is to record a “trace” of VM instructions that are executed while interpreting a virtual machine. When a backwards jump (a jump to a previously-executed program counter) is reached some threshold number of times, the instructions between the previous instance of that PC value and the present moment are presumed to be a loop, which is then compiled based on the trace. At the most basic version, the traced instructions are compiled linearly, and any conditionals type comparisons that were performed are transformed into “guards,” which will abort the trace back into the interpreter if they result in different outcomes on future executions.
Tracing JITs can be fairly simple, since (at their most basic) they only have to compile a concrete linear chain of instructions and guards, and don’t need complex flow-control analysis or other optimizations. They can also be fairly effective for broad classes of programs; LuaJIT, among others, is a widely deployed tracing JIT.
The problem with interpreters
These systems are pretty different at first glance, both in purpose and in implementation. It was therfore a bit surprising to continually find ourselves remarking on a core similarity between them! In particular, there is a rich class of programs on which both systems struggle to achieve their goals (effective exploration for fuzzers, performance improvements for tracing JITs): Interpreters and similar programs that “look like” interpreters.
Let’s take a quick look at how and why they fail. Consider a prototypical interpreter loop, which looks something like the following:
while (true) {
auto instruction = decode(memory[pc]);
pc++;
switch (instruction.opcode) {
case OP_ADD:
registers[instruction.dest] = registers[instruction.src1] + registers[instruction.src2];
// …
case OP_LOAD:
registers[instruction.dest] = memory[instruction.src1];
case OP_JMP:
pc += instruction.immediate;
// Other instructions
}
}
The core observation here is that the interpreter consists of a continually-executed loop, which has a “dispatch” (the switch
statement, here) to a different opcode implementation depending on the instruction to be executed.
Both tracing JITs and AFL-family fuzzers look at the program counter as an indication of progress through the program under analysis. However, for an interpreter, the patterns of the interpreter’s program counter are relatively unrelated to what the interpreted program is doing. AFL — looking solely at branches — will see a set of branches from the switch
instruction dispatch into the individual opcode implementations, and branches from the opcodes back to the top of the dispatch loop. It will therefore end up generating inputs with a de facto goal of something like “executing as many different VM opcodes as possible” — not totally useless, but also unlikely to generate particularly interesting sequences of bytecode, or to generate interesting inputs to a fixed bytecode program.
Similarly, a basic coverage-guided interpreter will see a lot of traces through the interpreter loop and will attempt to compile a single trace through the loop. However, traces through the loop vary wildly and almost completely, depending on the bytecode instruction encountered. Thus, a trace of one bytecode operation will abort very quickly when presented with another. Depending on the details of the implementation, a tracing JIT may end up giving up entirely, continually compiling and aborting, or otherwise thrashing in various ways.
In both cases, the basic problem is that, for an interpreter, the “real” state of interest is stored in the interpreted program counter, but these program analysis tools are stuck looking at the physical program counter, which contains relatively little actual information.
Other data-driven applications
Interpreters are perhaps the most classic or most extreme flavor of systems that are hard for these kinds of program analysis tools to handle, but they aren’t the only ones. Any system that is sufficiently data-centric can have a similar failure mode where the physical PC doesn’t contain enough information about the “real” state. These can include:
- state-machine-based decoders or parsers (e.g. this UTF-8 decoder)
- Parsers that consume input character or token-at-a-time and dispatch in some way (e.g. most regular expression engines pose a challenge for these systems)
- Tree walkers that recursively walk a tree and switch on the node type
- Event loops that process incoming events and dispatch them to various handlers
I tend to say that these kinds of systems are “essentially interpreters” or “look like interpreters” because they have a similar nature of a dispatch loop that is heavily data-driven. Any of these can pose difficulties for AFL to explore effectively, or for a tracing JIT to effectively optimize.
Promoting the interpreted PC
Once we recognize that the problem is of looking at a physical PC even though all the “interesting” state is stored elsewhere, one obvious direction for attack is to somehow surface the “logical” PC to the analysis system. And indeed, this turns out to be a fruitful area of research in both systems:
- PyPy (inspired in part by DynamRIO before them) implements annotations that allow an interpreter to tell the PyPy JIT about the logical program counter and other state of the interpreted program. This allows them to trace loops in the interpreted language, spanning multiple iterations of the opcode dispatch loop. You can read one of their earliest blog posts about the technique for further explanation.
- The IJON fuzzer adds an API for annotations to the program being analyzed, which allows developers to surface additional pieces of state into the coverage map, allowing the fuzzer visibility into the logical state. They apply this interface successfully to numerous examples, including a maze-solver and an event-based protocol handler, both instances of the broader class of “interpreter-like” programs I cited above.
While the details of how you lift state from the interpreted program into the fuzzer or JIT are quite different, I find the core ideas here remarkably similar.
Threaded interpreters
As I was writing this up and thinking about it, I realized there’s even a bit of an analogy to another challenge that interpreters face. The results of instruction dispatch vary wildly between executions, and so hardware branch predictors struggle to predict the destination. This unpredictable branch thus can often result in a missed branch prediction on most passes through the interpreter loop, and thus frequent pipeline stalls and low IPC.
One classic interpreter optimization — used by CPython, Ruby, and many other popular interpreters — is to use “threaded interpretation.” In this model, the outer loop and the switch
statement are replaced with a small epilogue at the end of each instruction which decodes the following opcode, and directly jumps to the handler for that opcode (typically using a jump table). This results in one jump per opcode, giving the branch predictor greater visibility into the structure of the interpreter. In particular, if opcodes often appear in the same sequences (and they often will in real code, since the compiler will tend to emit some similar patterns), the branch predictor has a much better shot at predicting those jumps.
If we squint, this feels like a similar problem to the above issues with tracing JITs and coverage-based JITs; the hardware makes certain assumptions about how the hardware instruction pointer behaves (namely, that it generally moves in repeated or at least predictable patterns), and interpreters violate that assumption by being so data-driven. Threaded code, among other advantages, surfaces a bit more of the structure of the logical program to the hardware, somewhat alleviating the problem. I haven’t run any experiments, but it seems likely to me that threaded interpreters also fare a bit better under AFL or inside of a tracing JIT.
Going further, we can consider the most basic version of a JIT, one which just inlines the body of the various opcode implementations one-after-another, with no further optimization passes (e.g. no type speculation or persistent storing of interpreter state into machine registers). One way to view such a JIT is as precisely the kind of “lifting” we described earlier: they do not execute enormously fewer instructions, but they do “lift” the interpreted PC into the hardware PC, in a way that allows hardware branch prediction and related predictors and optimizers to “see” the emulated PC and thereby work more effectively.
Closing thoughts
As is often the case on these newsletters, I don’t have a specific conclusion; I just found the parallels here interesting and wanted to write them up somewhere. If this is a phenomenon with a well-known name or pre-existing work, I’d love any pointers.
One thought is that I find this an interesting instance of leaky abstractions and special-casing. Hardware CPUs and language VMs at some level “claim” to be general-purpose abstractions that consume streams of instructions and execute them, and don’t care too much what that stream of instructions is. However, for performance and other reasons, we impose assumptions about the nature of those instruction streams, and code in special-cases to optimize well-known patterns, heuristics to guide program analysis tools, and otherwise. Code that violates these assumptions still executes, but experiences poor performance or other downsides.
We tend to design these assumptions in such a way that they work well for a broad base of existing software. However, once these get built and entrenched, our reliance on them grows deeper, and software that doesn’t fit into the “normal” or common case grows increasingly disadvantaged. This sort of process can result over time in a “herding” phenomenon where we prune off once-promising avenues of exploration or system design, because — even though they might still function at some level — we have so many baked-in assumptions that rely on systems that look like the dominant systems, that the alternate paths become unviable.
Finally, I’m fascinated by how interpreters are so consistently challenging for analysis and other tools to work with. They’re clearly “different” from “normal” code in deep ways, and I’m curious if there are theoretical models or frameworks that provide a neat articulation of that different-ness. Ideally, such models would provide a neat way to characterize the “interpreter-like” classes of programs I alluded to earlier. I’d love any pointers to prior art or other work in this space!