How We Run 16 AI Agents on $200/Month Without Going Broke
How We Run 16 AI Agents on $200/Month Without Going Broke
Issue #27 — The Rocky Relay Architecture Series
Last week I showed you how we bridged 50+ tools to Claude Code subprocesses via an MCP-over-HTTP shim. This week: the money problem.
Running 16 AI agents sounds expensive. It isn't — if you build the right budget system. We spend ~$200/month total (Claude Max subscription) and route overflow to providers that cost fractions of a penny per request. Here's exactly how.
The Problem Nobody Talks About
Every "multi-agent tutorial" shows you how to wire up API calls. None of them show you what happens at 3 AM when your research agent enters a recursive loop and burns through $47 in DeepSeek tokens before you wake up.
We learned this the hard way.
Without budget controls, AI agents are open spending taps. They don't know what things cost. They don't know they've been running for 4 hours. They just keep going until you pull the plug or your credit card screams.
The Architecture: Three Tiers, Three Budgets
Our LLM orchestrator classifies every request into one of three routing tiers, each with independent daily token limits:
┌─────────────────────────────────────────────┐
│ API_REASON │ 200K tokens/day │ ~$2.19/M │
│ DeepSeek R1, Gemini Pro │
│ Complex reasoning, multi-step debugging │
├─────────────────────────────────────────────┤
│ API_CHEAP │ 500K tokens/day │ ~$0.27/M │
│ DeepSeek V3, Gemini Flash │
│ Content generation, code, analysis │
├─────────────────────────────────────────────┤
│ LOCAL_FAST │ Unlimited │ Free │
│ Ollama qwen2.5:14b (on-device) │
│ Greetings, simple Q&A, extraction │
└─────────────────────────────────────────────┘
The primary engine is Claude via Claude Max ($200/month, usage-based within the subscription). Everything that overflows or doesn't need Claude's capabilities gets routed to these three tiers.
The Router: A 3B Model That Saves Us Thousands
Here's the part that surprises people: we use a tiny local model to classify requests before they hit any API.
A qwen2.5:3b Ollama model (runs in ~50ms on our Mac Mini M4 Pro) reads every incoming prompt and classifies it:
- LOCAL_FAST: "What time is it in Manila?" → No API call needed
- API_CHEAP: "Write a 500-word blog post about MCP" → DeepSeek V3 handles this fine
- API_REASON: "Debug this race condition across three interacting TypeScript modules" → Needs real reasoning
The router outputs a confidence score (0-100%). If confidence is below 70%, the system escalates to the next tier up. Better to overspend slightly on a hard question than give a garbage answer from a cheap model.
Heuristic Fast-Path (Skip the Router Entirely)
For obvious cases, we don't even call the 3B router:
// Greetings under 100 tokens → LOCAL_FAST (95% confidence)
if (tokenCount < 100 && isGreeting(prompt)) return LOCAL_FAST;
// Extract/rewrite/summarize under 800 tokens → LOCAL_FAST (90% confidence)
if (tokenCount < 800 && isSimpleTask(prompt)) return LOCAL_FAST;
This saves 1-3 seconds per trivial message. When you're processing 57+ cron jobs per day, those seconds add up.
Risk Flag Escalation
One override we added after a close call: medical, legal, finance, or security topics always route to API_CHEAP minimum, even if the router says LOCAL_FAST.
We run a pharma distribution company. We cannot have a local 14B model answering drug interaction questions with hallucinated confidence. The $0.001 per query is worth it.
The Budget Tracker: 24-Hour Rolling Windows
The TokenBudget class is deceptively simple:
class TokenBudget {
private usage: Record<RouteType, { tokens: number; windowStart: number }>;
private limits: Record<RouteType, number>;
recordUsage(route: RouteType, tokens: number): boolean {
this.maybeResetWindow(route);
this.usage[route].tokens += tokens;
const pct = this.getUsagePct(route);
if (pct >= 80) emit('budget_warning', { route, pct });
if (pct >= 100) emit('budget_exhausted', { route });
return pct < 100;
}
hasBudget(route: RouteType): boolean {
this.maybeResetWindow(route);
return this.usage[route].tokens < this.limits[route];
}
}
24-hour rolling window: Usage resets every 24 hours from the first request, not at midnight. This prevents the "dump everything at 11:59 PM" pattern.
Warning at 80%: Agents get notified they're running hot. They can choose to defer non-urgent work.
Hard stop at 100%: No more requests on that tier. Automatic downgrade kicks in.
The Automatic Downgrade Chain
This is where it gets interesting. When a tier's budget is exhausted, the system doesn't fail — it downgrades gracefully:
API_REASON (exhausted) → API_CHEAP (if available) → LOCAL_FAST
API_CHEAP (exhausted) → LOCAL_FAST
LOCAL_FAST → always available (it's on-device)
downgradeIfNeeded(route: RouteType): RouteType {
if (this.hasBudget(route)) return route; // Budget available, proceed
// Downgrade chain
if (route === 'API_REASON' && this.hasBudget('API_CHEAP')) return 'API_CHEAP';
return 'LOCAL_FAST'; // Ultimate fallback — always available
}
The fail-safe: If anything in the budget logic throws an error, it returns the original route rather than blocking. Better to overspend slightly than block all agent responses. A silent overspend is fixable; a bricked agent at 3 AM is not.
Per-Agent Cost Tracking
Every LLM call records:
{
agent: 'drucker',
provider: 'deepseek',
route: 'API_CHEAP',
model: 'deepseek-chat',
inputTokens: 1_247,
outputTokens: 892,
costUSD: 0.00058,
duration: 2_340, // ms
sessionId: 'abc123'
}
This goes into a SQLite token_usage table. We can query:
- By agent: Drucker (research) is our most expensive non-Claude agent. He loves long web research chains.
- By provider: DeepSeek V3 handles 60% of our overflow at ~$0.27/M tokens.
- By time: Usage spikes during business hours (Manila timezone) when RJ is actively delegating.
- Weekly Claude budget: Track Claude Max usage against subscription limits.
What This Looks Like in Practice
A typical day (real numbers from last week):
| Agent | Claude | DeepSeek | Ollama | Daily Cost |
|---|---|---|---|---|
| Rocky | 45 msgs | 3 overflow | 12 greetings | ~$0 (Claude Max) |
| Drucker | 5 research | 15 analysis | 2 trivial | ~$0.08 |
| Draper | 8 content | 22 emails | 5 parsing | ~$0.04 |
| Warhol | 12 drafts | 8 research | 3 simple | ~$0.05 |
| Others (12) | 20 total | 10 total | 30 total | ~$0.06 |
| Total | 90 | 58 | 52 | ~$0.23/day |
That's ~$7/month in overflow API costs on top of the $200 Claude Max subscription. 16 agents, 200+ daily LLM calls, for $207/month total.
The Confidence-Based Escalation Trick
The local Ollama model is surprisingly capable for simple tasks. But it knows when it's out of its depth — sort of.
We added a confidence threshold to the local worker: if the model's self-assessed confidence drops below 70%, it escalates to the API tier automatically.
const localResult = await ollama.generate(prompt);
if (localResult.confidence < 0.7) {
// "I'm not sure about this" → escalate
return escalateToAPI(prompt, 'API_CHEAP');
}
This catches ~15% of local requests that would have produced bad answers. The cost is tiny ($0.001-0.01 per escalation). The quality improvement is significant.
The Provider Fallback Chain
Budget routing is one axis. Provider reliability is another. APIs go down. Rate limits hit. Timeouts happen.
Our fallback chain for the API tiers:
API_CHEAP: Gemini Flash → DeepSeek V3 → OpenRouter → Grok
API_REASON: Gemini Pro → DeepSeek R1 → OpenRouter
A health checker pings each provider every 60 seconds. If a provider is down or latency exceeds thresholds, it's temporarily removed from the chain. When it recovers, it's re-added.
This means our agents have never experienced an LLM outage since we built this system. Individual providers go down weekly. Our system doesn't notice.
What We'd Do Differently
1. Start with per-agent budgets, not just per-tier. We currently limit by API tier (API_CHEAP gets 500K tokens/day total). It would be better to also limit per agent — prevent Drucker from consuming the entire API_CHEAP budget on a single research rabbit hole.
2. Add prompt compression earlier. When escalating from local to API, we now compress long prompts to save tokens. We should have done this from day one. Compression saves 20-40% on token costs for long-context queries.
3. Track cost per TASK, not just per LLM call. A single task (like "research competitors") might trigger 15 LLM calls across 3 agents. We track individual calls but can't easily roll up to "this task cost $0.45." That would help us kill expensive recurring tasks that deliver low value.
The Numbers
| Metric | Value |
|---|---|
| Claude Max subscription | $200/month |
| Overflow API costs | ~$7/month |
| Total LLM spend | ~$207/month |
| Daily LLM calls | 200+ |
| Agents served | 16 |
| Local (free) routing | ~26% of requests |
| Budget exhaustion events | 2-3 per week |
| Provider outage impact | Zero (auto-fallback) |
| Time to add new provider | ~30 min (implement adapter + add to chain) |
Next Week
Cron Jobs and Autonomous Execution — how we run 57+ scheduled jobs daily, handle the "what if two crons trigger the same agent at once" problem, and why episodic execution (short 8-turn episodes with structured handoffs) was the key to overnight autonomy.
Part 3 of a weekly technical series on building production AI agent systems. Written from the trenches — 16 agents, one Mac Mini, Cebu City, Philippines.
Subscribe: buttondown.com/the200dollarceo Full series: buttondown.com/the200dollarceo/archive