The Observation Post logo

The Observation Post

Archives
Log in

The Agent Era Has Arrived But It Looks Nothing Like the Demos

The Observation Post

Tech · AI · Cyber · Defence

The Agent Era Has Arrived But It Looks Nothing Like the Demos

27 May 2026 · 8 min read

Networked agent systems diagram

Every demo video I have seen in the last year shows the same thing. A digital employee that books your flights, writes your code, does your shopping, sends your emails. You sit back and sip coffee. The room applauds. The VCs have already written the checks.

I have been building with these things for almost a year now and the gap between those demos and reality keeps getting wider. The agents that actually work are boring. The ones that go viral fall apart the minute you give them a real task. The teams I respect most are quietly walking away from general purpose agents in favor of something that looks like regular software with an LLM bolted onto it.

I watched a demo last month from a startup with serious funding. Their agent opened Chrome, navigated to a government portal, filled out a form across multiple pages, uploaded a document, and submitted it. Flawless. Clean network, known layout, no surprises.

I asked the CEO what happens when the portal changes its HTML. He laughed nervously.

The web is a minefield for these agents. Captchas block them. Auth flows change without warning. A button moves three pixels in a CSS update and the agents locator logic breaks. A popup appears and it clicks the wrong thing. The failures cascade. One wrong click compounds into something that looks like a toddler mashing keys, except the agent will tell you everything went fine and you do not find out otherwise until hours later.

I ran these agents against real tasks as part of my work. Booking a flight works maybe four out of ten times if the site is simple. Filing an expense report works one or two out of ten. The failures are silent. The agent thinks it succeeded, it reports success, and you only discover the truth when no confirmation arrives.

Google DeepMind published a WebArena paper earlier this year. The best web agents score around 35 to 40 percent task completion on realistic scenarios. That is not production ready. It does not matter how good the curated demo looks.

So what actually works?

Coding assistants are the success story everyone points to and they earn it. Copilot, Cursor, the terminal agents. They work because the feedback loop is tight and the environment is constrained. You are editing a file in a known project structure. The linter catches bad function calls in seconds. The compiler fails fast. When an agent hallucinates, the system corrects itself before the error propagates.

But even here the limits show up. I have watched agents confidently generate an entire test suite that passed lint and compilation and tested the wrong thing. The tests were green, the coverage looked great, and the application still crashed in production because the agent tested its own assumptions rather than the actual behavior. The feedback loop catches syntax errors, not semantic ones. Semantic errors need a human, always have.

Customer support triage is another. I know a company routing about 40 percent of inbound tickets through an agent that tags, prioritizes, and drafts responses. A human reviews every draft before it reaches a customer. Errors get caught before they matter. The agent handles the volume, the human handles the judgment.

Data extraction pipelines. PDF to JSON. Monitoring dashboards. Scraping structured data from layouts you already understand. These all work because they are narrow and repetitive and you can batch verify the results. Nobody is asking the agent to think creatively. They are asking it to do a defined task within defined parameters, and if it drifts, the next item in the batch catches it.

The agent is never the product. It is a component inside a system with human guardrails. Take the human out and reliability drops off a cliff.

There is a structural reason open ended agents fail that I think about a lot. Every LLM call has a nonzero error rate. Say 95 percent per step which is generous. A ten step agent has a 60 percent chance of finishing clean. A twenty step task drops to 36 percent. At fifty steps you are basically guaranteed a failure somewhere.

A better model does not fix this. Even a 99 percent per step model that does not exist yet gives you only 90 percent at ten steps and 60 percent at fifty. The math does not forgive.

The fix is architecture. Shorter chains. More verification steps. Tighter feedback loops. A willingness to say I dont know and hand off to a person. But building that is hard engineering. Not prompt tweaking. Most agent startups skip it because it makes the demos slower.

The teams I know shipping agents in production share nothing with the sales decks. They use tiny models for specific things. A 7B parameter model for classification. A fine tuned 1B model for entity extraction. The big model only fires when the small ones disagree. Cheaper, faster, more reliable than routing everything through GPT-4.

I have seen one team build what they call a router. A small classifier model looks at an incoming request and decides which specialized subagent should handle it. The subagent is another small model fine tuned for exactly one thing. The big expensive model only ever gets called when the router cant decide, which happens maybe 5 percent of the time. Their costs dropped by 80 percent and their accuracy went up because each small model is excellent at its one job instead of mediocre at everything.

Tools over reasoning. Give the agent an API call instead of asking it to navigate a website. A structured extraction template instead of asking it to parse an email naturally. Every unstructured surface is another opportunity for the agent to fail in a creative new way. The teams that succeed treat the model as the worst part of their system and design around its weaknesses, not its strengths.

Humans actually in the loop. Not a checkbox. A real workflow where the agent drafts and the human confirms, and every correction trains the next version. The best teams treat this as a continuous training pipeline, not a QA gate. I know a company that collects every human correction, clusters them by type, and fine tunes a small model on the most common corrections every two weeks. The error rate drops measurably with each cycle.

Obsess over observability. Every agent action gets recorded, replayed, scored. Failures get categorized. The team fixes the top three failure modes every sprint. Reliable agents are not designed in a lab. You build them by watching them break in production and patching the breaks one at a time.

The reliability tradeoff is hidden in plain sight. The more reliable you make an agent, the more engineering effort you invest, and the narrower its scope becomes. An agent that processes invoices with 99.9 percent reliability took someone six months to build and will never do anything else. An agent that tries to do everything will always be wrong in unpredictable ways. You can have broad or you can have reliable. You cannot have both with current technology.

A lot of the current hype wave reminds me of the early chatbot era in 2023. Everyone was sure that LLMs would replace customer service entirely. What actually happened is that companies deployed chatbots, customers hated them, and the ones that survived ended up as triage layers that hand off to humans fast. The agent wave is going through the same arc, just compressed. We are in the phase where the hype exceeds the capability by a wide margin. The useful stuff will emerge after the disappointment settles.

I think specialization wins over generalization here. Not the agent that does everything. The one that processes invoices really well. The one that triages support tickets. The one that catches bugs in CI. Each one narrow and limited and saves real money by automating something boring a human was doing.

The agent startups I am interested in sell into specific verticals with specific workflows. Healthcare billing. Logistics routing. Insurance claims. The inputs are structured, the error modes are known, the ROI is calculable. Nobody needs an agent that thinks. They need something that does this one stupid task reliably so their humans can work on the stuff that actually matters.

I keep coming back to something a founder told me after two years of building agent infrastructure. His customers kept saying the same thing. We do not need an agent that thinks. We need an agent that reliably does this one boring task.

That is where we are. Agents in 2026 are not your new coworker. They are slightly smarter cron jobs with guardrails. The useful ones are narrow, invisible, and boring. And once you stop expecting them to be more than that, they actually start being useful.

The practical takeaway for anyone trying to build with these things: start with the smallest possible version of the task. Do not try to build a general agent. Build something that does exactly one thing, put a human reviewer in the loop, measure the error rate, and only expand scope when you have the failure modes under control. That is not as exciting as the demo videos. But it ships.

Read on web →

The Observation Post — daily posts on tech, AI, and what matters.

Don't miss what's next. Subscribe to The Observation Post:
Powered by Buttondown, the easiest way to start and grow your newsletter.