Uber capped its AI budget in 2026 after blowing through the whole year's allocation in four months. A different company spent $500 million on Claude in thirty days, by accident, and found out from the bill. Somewhere on X there's a screenshot of someone's agent mid-task with the caption "my AI agent is dead until Friday. rip 🪦" — it hit a rate limit, stopped, and nobody noticed until the deadline was already gone.
Three different companies, three different scales, the same failure: the agent didn't know its own limit until it hit it.
The gap nobody built for
A REST API has one caller at a time, one request, one clear answer to "did that work." An AI agent is a loop — it calls a model, gets a plan back, calls a tool, calls the model again, maybe spawns a sub-agent that does the same thing. Every one of those hops can burn tokens, and none of them ask "how much do I have left" before they go. They find out by getting a 429, or by not finding out at all until someone reads the invoice.
Rate limiting middleware has existed for twenty years and none of it was built for this. It counts requests, not the tokens inside them — a single streaming completion can burn 100,000 tokens in one call, and a rate limiter built for REST sees exactly one request either way.
What I built instead
RateGuard wraps the HTTP client your LLM SDK already uses — one line, rg.WrapClient(&http.Client{})
in Go, rg.wrapFetch() in Node, rg.wrap_httpx_client() in Python — and every call through it
gets metered with the token count the provider actually returned, budgeted against an hourly,
daily, or monthly cap, and routed through a breaker that isolates a dying provider from a healthy
one. No proxy sits in front of your calls. No new service to run.
The part that's actually new is the direction of the question. Instead of the agent finding out it's out of budget by getting rejected, it asks first:
tools := rg.MCPTools()
// get_token_budget, get_rate_limit_state, get_circuit_breaker_state,
// check_loop, list_limits — an agent calls these before it spends, not after.
Five tools, over the Model Context Protocol, callable from Claude Code, Cursor, or anything else that speaks MCP. Querying never consumes the budget it's asking about — an agent can check as often as it wants for free.
The part I shipped this week: budgets you can actually hand off
Knowing your own limit solves the single-agent case. It doesn't solve the one that's becoming normal: an orchestrator spins up a sub-agent to handle one piece of a task, and that sub-agent has no enforced limit at all — it just inherits whatever trust the orchestrator gave it, which in practice means unlimited, because nobody wrote the code to make it otherwise.
So RateGuard now mints a token for that handoff. The orchestrator signs a budget — say, 10,000 tokens, OpenAI only, expires in ten minutes — and the sub-agent gets a cryptographic object that proves exactly that scope, nothing more. If the sub-agent tries to spawn a further sub-agent with a larger budget, or a longer expiry, or a provider that wasn't in the original grant, the signature check rejects it. The chain can only get narrower going down, never wider — the same shape the IETF's draft Agent Identity Protocol is standardizing around, which tells you this isn't a made-up problem; other people are independently arriving at the same fix.
root, _, _ := rateguard.NewRootBudgetToken(authorityKey, rateguard.AttestOptions{
Grant: rateguard.BudgetGrant{MaxTokens: 100_000, MaxDepth: 3, ExpiresAt: oneHourFromNow},
})
delegated, subAgentKey, _ := rateguard.Attest(root, orchestratorKey, rateguard.AttestOptions{
Grant: rateguard.BudgetGrant{MaxTokens: 10_000, MaxDepth: 0, ExpiresAt: tenMinutesFromNow},
})
What's next
The next post is about why this runs inside your process instead of in front of it as a gateway — and about the March 2026 incident that makes that a security argument, not just a preference. After that: the meta one, about building a rate limiter for AI agents mostly by directing AI agents, including the session where two of them were editing the same file at once and I had to sort out whose change was actually correct.
RateGuard is MIT-licensed, three SDKs (Go, Node, Python), 212 tests.
github.com/varbees/rateguard — the docs, including a
walkthrough you can run with go run, no API key required, are at
rateguard.antharmaya.com/docs.
