Quis custodiet ipsos custodes?
[Hi, there’s a bug in the Pilot integration with Buttondown, so you get a copy-and-paste version of the article on the site.]
Sometimes you don't realize you've discovered something until you're hip-deep in it. We're working on a new, pretty ambitious project, and we found ourselves using the same structural solution to three different problems.
And we were like, hey.
The "structural solution" is adversarial architecture. And now that we see it, we can't unsee it.
Here's the short version. AI-first systems have a failure mode that traditional software doesn't: they're confidently wrong. When a database query fails, you get an error. When an LLM misinterprets a document, you get a perfectly formatted, completely plausible, subtly incorrect answer. Nothing in the system tells you it's wrong. The output looks exactly the same whether the system understood the input or hallucinated its way through it. If you've built anything serious on top of a language model, you know this feeling: the demo works beautifully, then real data arrives and you start finding errors that nobody caught because the errors look like answers. Your users don't report bugs; they just quietly stop trusting the system. And once that trust is gone, it's almost impossible to get back.
We kept running into this. Our extraction layer would over-classify things: a suggestion tagged as a commitment, a conditional statement treated as a firm assertion.
There's a thing we've been reading about called "Generative Adversarial Networks." GANs. It's a way to train LLMs: deep science stuff. The basic tenet is (as we understand it) "build a system to argue with your system." For an overexuberant extraction layer, the solution is an adversarial, second, hypercritical model, with different directions: prove the extraction layer wrong. The first model's incentive is to find meaning, the second model's incentive is to find errors. The tension between those two objectives produces output that's better than either model alone. More importantly, it produces output with honest confidence scores: "I'm sure about this one" vs. "this is probably right but the language was hedged" vs. "this is an inference and you should verify it."
Have you noticed, when you're working with an LLM conversationally, that it is overwhelmingly supportive and positive? Claude will agree with an assertion and then emphatically agree when you change your mind, right? So you learn that supportive isn't the same as complete, that the question you ask is really important, that you need to corroborate, that you need to force tension into a process that's trying to eliminate tension at every turn.
Same thing at the system architecture level.
We used the same pattern with test data: build an independent system whose job is to produce realistic inputs for the application; neither of those projects know about each other. Used it again with migration: when you ingest data from an existing system, the data arrives with labels (this is a "Customer," this deal is in "Negotiation," this health score is "Green"); those labels were assigned by the old system's logic, which is exactly the logic you're trying to transcend. So you treat the imported labels as hypotheses to be tested, not facts.
Three different problems. Three different layers of the architecture. The same underlying insight every time: a single model's output, no matter how capable, isn't trustworthy on its own. You need structured opposition. A proposer and a challenger. A thesis and a critique. Quality from tension, not from trust.
That's when we wrote the paper.
"The Confident Wrong Answer" lays out the pattern formally: where it comes from (the lineage runs from peer review through red teaming through GANs), how it works at the application layer (Proposer, Challenger, Resolution), why it's economically viable (the Challenger is cheaper than the Proposer; the adversarial pass adds 25-30% to extraction cost, which is the cost of not shipping garbage), and what it means for anyone building AI-first products.
We didn't set out to write a paper about adversarial architecture. We set out to build a product that works. The pattern emerged because the problems demanded it; the paper exists because we think once you see it, you realize it applies everywhere: any system where an LLM produces output that humans will act on. Which, increasingly, is every system.
If you're building something like this, or thinking about it, or arguing with your team about how to make your AI features trustworthy, the paper might save you a few months of arriving at the same place we did. [Read the full whitepaper here.]