Decision Logs

c'est la vie

                August 16, 2024

            Decision Logs

            I haven’t been writing much here recently, mostly because (a) I have a new infant at home, and (b) I’ve been spending much of my free time migrating my blog to hugo (which is now live), setting up POSSE-style content syndication on social media (see here, and pushing a handful of technical blog articles past the finish line, including:

An FHE overview
A catalog of FHE uses in production
A tutorial on using PDLL in MLIR

So the softer material that I like to write about here hasn't been flowing much.
That said, Justin Duke, who runs Buttondown which powers this newsletter, recently wrote on Bluesky about the idea of a decision log: keep track of the decisions you make and why.
I independently had this idea when I worked in Google's supply chain world. I called it the "policy change log," since it was primarily there to keep track of requests for business logic changes that came from outside our org. Small stuff like, "make sure this type of RAM is preferred over this other type of RAM for these kinds of machines."
These requests often came over private chats or in small meetings between my team and one stakeholder. They were somewhat necessary, in the sense that the automated policy management systems we had in place were not expressive enough to support their requirement.
After a year or two of fielding these requests, I started to see that the people asking for the features were sometimes wrong about their motivation (they believed something that was not true, e.g., something cost more than another thing) or they simply didn't have all the information, and that the person who did wasn't around to set them straight. It might not be discovered until months later that our automation was doing the wrong thing, and the policy would have to be reversed.
So I instituted an informal policy for our team, and made sure my engineers were on board: don't make random out-of-band business rule changes without a 1-page doc explaining the change and why it would be made. I.e., these are changes requested without the sorts of docs and prioritization that accompanied month+ long projects.
This was relatively small amount of extra fuss. It took maybe one hour to write up the doc and get it approved by the person asking for it. Then we'd send out a short email to common stakeholders saying "by the way, we're going to do this if nobody tells us not to." And then I'd post a link to the doc with a 1-line summary in a markdown table on our team's internal website (the "policy change log").
This gave us a few superpowers relative to the added fuss. It gave us proof and receipts when the requested policies led to bad outcomes. As a consequence of being hidden behind a complex automated system, our software was blamed by leadership as unreliable. With the log, when someone complained, we could point to the doc and say, "Go talk to so and so who asked for and approved this quirky behavior." 
While saying "we need to write a 1 page doc" was sometimes enough to make the person drop the request entirely, it also gave us a lot of ammo to say "no" to repeated requests—new management comes in and doesn't remember the failures of the past—or near-duplicate requests that repeat similar mistakes. It's extremely powerful to point to two docs, written a year apart, the first of which has a poorly-justified policy change, and the second of which links to a postmortem or list of bugs, and say, "your proposal smells like this, figure out a way to avoid it, or else you better quantify how much money it's going to save us." While they did that, we had more time to do what we saw as our more important work.
The policy change log also helped PMs see what the hell was going on with our software. It shone light on a chaotic organizational system, in which our team's software was at the center of many competing priorities. In an internal talk, I used an image of a sack of money in the center of a 4-way tug of war, with each department labeled at the end of each rope. They all claimed their changes were necessary to save Google money, but it turned out most of them were mainly optimizing for their own metrics, and sometimes doing that gave net negative savings. This is not a new revelation by any means, but having the policy change log gave me sufficient evidence to justify broader projects that tried to navigate the competing incentives. Perhaps I can write about the engineering work I did on that in a future newsletter.
I had about 25 entries in the log—most of which I wrote myself—before leaving the org to work on cryptography compilers. I checked today, and it seems my team did not continue adding to the the policy change log after I left. As I try to get my fussy toddler to say whenever something doesn't go his way, c'est la vie.

Don't miss what's next. Subscribe to Halfspace: