Profiling llama

seconds

                March 31, 2021

            Profiling llama

            Profiling llama
I wrote a few months back about llama, my experimental project for executing shell commands in Amazon Lambda. I briefly previewed llamacc, the GCC-compatible wrapper that allows for building C and C++ software using Lambda to outsource the compute, in the style of distcc.
Since then, I’ve done a bunch of work on llamacc performance, and have reached the point where it can achieve at least a 5x speedup (10 minutes → 2 minutes) building the core of LLVM compared to my Ryzen 3900 desktop — already quite a beefy machine. Today I want to share some notes of how I went about even understanding the performance of this system, since it was (and continues to be) a pretty interesting problem.
The complexity of llamacc
A build using llamacc invokes a – frankly – overwhelmingly complex distributed system. 
You run make -j <big number> to run many compile processes at once. Each compile process invokes llamacc, which does a local RPC to the llama daemon¹. That daemon uploads the source .c  file and any headers it relies on to S3, and then calls Amazon Lambda to invoke the compilation function. Behind the scenes, Lambda keeps some number of function images “warm” (initialized but suspended, able to quickly handle new requests), and also launches new functions on-demand as needed. Your request will be scheduled to one of those functions, which will itself talk to S3 to download all the required source files, invoke gcc locally to compile it, upload the resulting object file to S3, and then return a reply to the client.
Understanding the performance of a system like this is a nightmare. There are so many things that might bottleneck and tank your latency. Maybe you’re saturating the local CPU with all the make jobs and discovering dependencies; maybe shuttling data to and from S3 is your problem. If you add more parallelism, maybe that means AWS has to start more Lambda workers instead of reusing them, and the cold-start latency ends up losing all of the gains from concurrency. Maybe the Makefile doesn’t have has much available parallelism as you’d like. Maybe it does, but there’s one compile in particular that’s much slower than the rest and so you end up waiting for it. Or more!
There’s no “off-the-shelf” profiler for systems like this; you inevitably end up having to build some level of custom tooling and instrumentation to gather data. This post is about the tools I have built and used so far in my work on Llama.
Tracing Llama
Honeycomb
The first tool I attempted to integrate was Honeycomb. I am a huge fan of their product, and it’s specifically designed for observing gnarly distributed systems. I used their beeline package to add tracing spans and instrumentation to the llama daemon, sending a single trace tree for each “upload/invoke/download” cycle. I also added some simple timers to the remote executor, and adding them to the Honeycomb spans, which let me observe how long the Lambda process spent downloading vs executing the compiler vs uploading data.
Overall, I was pretty pleased with Honeycomb as a first go. I was able to query over all the invocations in a build to see how long various operations were going, and use their tracing to zoom into individual compiler invocations. I could see pretty clearly, for instance, that my bottleneck was uploading to S3, with individual builds spending multiple seconds sending data!

However, I ran into two frustrating limitations with Honeycomb:

There was no way to do (as far as I can tell?) horizontal sums. I could look at “upload latency” over time, but if I wanted to, say, sum the total time (or get the average or p50 time) over an entire time window, Honeycomb does not seem to support that operation. Being unable to look at aggregate statistics over an entire build fed into my second complaint:
It was difficult to compare builds. Honeycomb is (understandably) very focused on time, wanting to show results over time. There was no way to easily bookmark or name a “build” (i.e. an entire make invocation), and especially not to compare two side-by-side. If I made a change and wanted to ask if it, say, spent less time in S3 upload, I mostly had to just do two queries in separate tabs and eyeball.

Ultimately, these traces are small data, or medium data at best, and I kind of just wanted to download them so I could do arbitrary analysis locally; but I could not find a way to export full trace data from Honeycomb.
The experiment did sell me on the virtues of distributed tracing as a model for understanding performance. Being able to zoom into individual runs, and to break down metrics on arbitrary axes in order to understand patterns, was just too useful, and attempting to pre-aggregate metrics just seemed way too inflexible.
Roll-your-own
I briefly looked at AWS X-Ray and OpenTracing, but ultimately decided that this use case was (a) odd enough and (b) small enough that I would just build my own.
I coded up my own small tracing framework, with a pretty standard Span type:
type Span struct {
        TraceId  string `json:"trace_id"`
        SpanId   string `json:"span_id"`
        ParentId string `json:"parent_id"`

        Name     string                 `json:"name"`
        Start    time.Time              `json:"start"`
        Duration time.Duration          `json:"duration"`
        Fields   map[string]interface{} `json:"fields"`
}

I also built as a small Go API for using package context to store and propagate traces. I built a simple exporter to newline-delimited JSON, and ported all my Honeycomb code over to the new framework. I could now run builds and get detailed tracing and timing data for each build in a local JSON file, ready for analysis.
Propagating traces
I knew I wanted my tracing to produce a single unified trace encompassing the code running in:

The llamacc binary
The llama daemon, which llamacc RPCs to and which actually talks to S3 and Lambda
The remote code running inside Amazon Lambda

To handle this, I built an in-memory trace exporter, which just captures a set of Spans into an in-memory buffer, which we can then pass around. I then:

Augmented the Lambda communication protocol to pass trace and span IDs, and to return a full set of spans recorded during the remote execution, which the local client then adds into its local trace recorder, so they end up in the JSON output, and
Added an API to the llama daemon to accept traces, and had llamacc perform an additional RPC to log its own spans to the local daemon.

This gave me unified tracing all the way from llamacc invocation through the local daemon and into AWS.
Using traces
Now that I had these traces, what to do with them? A single defconfig build of the Linux kernel produces about 400M of JSON, certainly far too much to look at by hand. I ended up building/using three tools to help make sense of these files:
A Chrome trace viewer export
Chrome ships with a powerful trace-viewing tool, available at chrome://tracing/. This viewer, originally designed for use by the Chrome developers, has become a de facto standard for viewing trace data. It has a semi-standard format for accepting trace data, so I built a converter from my JSON traces to that format. This let me view entire builds, looking something like this:

Being able to visualize and zoom into builds was useful, but I discovered that the trace viewer seems more design for stack-trace-style data, where events neatly nest, them for asynchronous trace trees. I did not find an easy way to display traces where concurrent spans overlap and don’t nest, other than using lots of different “processes”, which becomes hard to navigate in the trace viewer.
Jaeger
Searching around, I discovered Jaeger, an open-source UI for viewing and exploring trace trees. It was pleasingly-easy to deploy a local instance in Docker. I did not bother with any of its official APIs, because it has a “import from JSON” right in the UI that accepts a semi-documented JSON format. This gave me a lovely UI for exploring traces.

SQLite
I mentioned earlier that I wanted to aggregate across builds and also compare builds against each other. For this purpose, I built a SQLite exporter for my trace data. Using SQLite’s JSON1 extension, I was able to keep the Fields metadata in my traces as a JSON blob, and import into SQLite in a one-span-per row format. This made it incredibly easy to query traces in a very flexible way. I could easily import multiple traces — tagging them with a build_id field so they could be distinguished — and compare them interactively using SQL queries. For example, here’s a summary of a recent test build I did:

The visualization tools have been very helpful for doing deep dives into individual issues, but these SQL analyses have been the main driver of my performance work!
SQLite is a really incredible piece of software. I was able to import multiple traces — totaling a few gigabytes of data — and run multi-stage aggregates and comparisons with a few seconds of latency. I don’t even need to do any indexing or careful schema design; my one bit of sophistication is creating a view that pre-extracts a number of common fields from the JSON fields map, for slightly improved ergonomics.
If the data was even bigger, this is a use case where a column store would really shine for improved performance — csv2parquet shrinks my CSV files by 6x, and I suspect would do even better if I flattened the JSON into real columns. I briefly looked into sqlite-parquet-vtable, but was unable to get it to build. I gave up because SQLite is Just Fine for now.
Lessons Learned
This post is mostly a “what have I been up to” and an attempt to publish-as-I-go on llama, but I also want to reflect on some of the broader themes I take away here.
Understanding performance is distributed systems is really hard
Working on llama has really made me appreciate all over again just how hard performance work is in distributed systems. I’ve had somewhat similar experiences before, especially when attempting to optimize Stripe’s distributed CI system, but this one is hairier than anything I’ve worked with before. As our systems keep getting more complex, we are lagging behind at creating tools to understand them; I think there’s still an enormous opportunity for observability tooling for understanding and debugging modern systems. As mentioned earlier, I was a huge Honeycomb fan before starting this project; this experience really redoubles my belief in the problem that they are trying to solve, while also driving home for me that they have a lot more work to do to really crack it!
As someone who cares a lot about performance, it frustrates me just how hard it is to understand performance here, because it seems clear to me that as our systems get more and more complex, we’re just going to be continually leaving even more performance on the ground for simple lack of even noticing the problems.
Distributed tracing is a powerful data model
I’ve long been a little bit skeptical of distributed tracing tools, I think in part because I mostly haven’t worked on the kinds of systems where they really shine. But it’s also the case that I think the tools are still, as an ecosystem, a bit new, and are still figuring out exactly the right interfaces (both technical and user-facing), and especially how to best turn the sheer amount of data traces produce into actionable insights, reasonably efficiently.
Aside from the tools, though, this project definitely sold me on the data model of traces built up of linked spans as the right model to organize performance and metrics (and likely even debugging) data for distributed systems. It is very flexible and powerful and maps well to the ways we build systems. And, importantly, supporting arbitrary read-time aggregation and focus on individual requests or specific pools of requests. I initial considered trying to profile llama using simple profiling counters and metrics (using, perhaps, something like Prometheus) and it just would be very frustrating to not be able to do arbitrary post-hoc analyses of behavior patterns.
The tracing ecosystem is still young
It’s far more mature than it was when Twitter released Zipkin in 2012, but it’s still a confusing mess of options with no clear winners and too many standards. It seems like there’s a work-in-progress effort that is probably even succeeding to unify a lot of the tools around the OpenTelemetry standards, but when I briefly looked into using the OpenTelemetry APIs or trace formats directly for my own tracing, I was extremely overwhelmed by complexity and the overwhelming number of options and settings. For actually consuming trace data, again, there’s a whole bunch of tools with apparently various strengths and weaknesses and it’s hard to understand which ones you might want.
Thinking about resource utilization in the cloud is weird
I’m a big fan of Brendan Gregg’s USE Method for analyzing performance, and it’s on my list to try to apply it more rigorously to this system; but at the same time, when I think about all the different resources involved, both software and hardware, on all the different machines, I despair of actually enumerating and tracking them all in a coherent way. 
Especially when autoscaling is involved, for something like Lambda, it’s not even clear to me that talking about hardware utilization is the right metric anymore; you can spin up near-infinitely many concurrent processes, but new processes will have a cold-start penalty, both from Lambda startup time and from having cold caches internally. I’ve seen cases where fixing a bug to get greater concurrency actually reduces performance, because of the cold-start penalties. I’m really not even sure what the right models are to think through these issues systematically.
It’s also hard to get good observability of resources in the cloud. There are a lot of software and hardware resources hidden inside the Lambda environment that you just don’t have direct visibility of. For a single system, essentially all work passes through or is mediated by the kernel in some form, so tools like DTrace or BPF can, with expertise and care, let you observe very-nearly any resource you might care to. For distributed systems, our points of leverage aren’t nearly as good or as universal, and it’s a bit hard to imagine how they ever could be.

Having a persistent daemon allows llama to keep persistent TLS connections to S3 and Lambda, as well as persist assorted in-memory caches and otherwise share state between invocations. For instance, the daemon is responsible for ensuring that only one process uploads a given C header file; the rest can just refer to the already-uploaded object in S3. ↩

Don't miss what's next. Subscribe to Musings on computer systems: