Observations on AI agent token consumption

A new paper from researchers at Stanford, Michigan, DeepMind, All Hands, Microsoft AI and MIT is the most detailed open empirical study I’ve seen of how AI agents actually spend tokens at scale¹. The authors run eight frontier models across 500 SWE-bench Verified tasks with four runs each, capturing full trajectory telemetry decomposed by token type, phase and action. They release the dataset alongside the paper, which is to my knowledge the most granular public corpus of agentic trajectories currently available.

The paper is rigorous, careful about what it claims and puts hard numbers on questions that have until now only been answered with anecdotes. I’d recommend reading it in full.

What follows is a walk through four of the paper’s observations, interleaved with what we are seeing at Flowstate from the exact same patterns surfacing in customer environments. We sit in the request path between the user and the AI provider, which means we observe the same trajectories the paper analyses, but in production, across a much broader set of AI tools than SWE-bench covers.

The two sets of observations are remarkably close. The researchers measured it on a benchmark; we see it on customer devices. The agreement between the two is what makes this paper so useful for anyone trying to actually manage this spend.

Input tokens dominate agentic spend

The paper’s first finding is that agentic coding consumes around 1,000 times more tokens than equivalent code-chat or code-reasoning tasks, with an input-to-output ratio of roughly 153:1 (against 1.33 for chat and 0.16 for reasoning)².

The reason is structural. Agentic workflows accumulate context across rounds, and the same content is fed back into the model on every single turn. Token caching helps at the margins, but the sheer volume of accumulated context dominates the cost.

This is the exact pattern we see in non-agentic AI usage as well. Chat-style usage of Claude, ChatGPT and similar tools follows the same shape because users continue conversations across days rather than starting fresh sessions with explicit context. One customer described it to us this way:

“We think they’re creating PowerPoints, and then they’re like, ‘change this word on slide three’, and then they’re just continuing to generate these really large documents.”

That is the paper’s finding in human form. A chat session that should have been a fresh prompt becomes a thread that re-pays for its entire history on every turn. The user thinks they are making one small edit. The model is being asked to re-process the entire document. The vendor charges accordingly.

The implication is that a massive share of controllable AI cost sits upstream of the model. Better prompts. Fresh sessions. Explicit context provided once, rather than constructed iteratively over an afternoon. The agent’s behaviour is largely a consequence of how it was set up.

Model choice produces order-of-magnitude cost differences

On the 230 SWE-bench tasks that every tested model successfully solved, Kimi-K2 and Claude Sonnet 4.5 used on average 1.5 million more tokens than GPT-5³. Same problems, same correct answers, vastly different token appetites.

The paper is careful to rule out the obvious explanation: the cost gap persists on both the shared-success subset and the shared-failure subset. The more expensive models were not tackling harder problems. They were simply spending more tokens on the same problems.

This matches a behaviour we observe consistently. Users default to whichever model is most prominent in the UI, and “most prominent” typically means most expensive. Opus when Sonnet would have done the job. Vendors have no commercial incentive to route users toward cheaper models. From another customer conversation:

“We definitely know that people are using just all Opus. The people that are using up their tokens, they’ll continue to do that unless there’s a way to control it. We did not know there was a way to control that in Claude. I know there isn’t.”

There is a way to control it, but it doesn’t live in the vendor’s product. The natural place for it is the layer that can see the task category and route at the request level: boilerplate to the leaner model, long-form planning to the heavier one. The Stanford finding that token efficiency is a property of the model rather than the task is precisely what makes routing viable. If heavier models only burned more tokens on harder problems, routing would be useless. They don’t, so it isn’t.

Token usage is highly variable and difficult to predict

The paper’s third observation is that four runs of the same model on the same task can produce up to 30x variance in total token cost⁴. The most expensive run on a given problem costs roughly twice the cheapest run on average. As cost goes up, predictability goes down.

More pointedly: the authors test whether agents can predict their own token usage before executing a task. They found correlations of at best 0.39. All eight models systematically underestimate⁵. Even the agent does not know what a task will cost.

What we see on the customer side is leadership trying to manage spend with the only data available to them. Usually a chat count from a vendor admin dashboard:

“What are these five people doing? They’re always saying they don’t have enough tokens.”

A chat count does not answer this. A token count answers how much was spent, but not why. The “why” is structurally invisible from the invoice. You can only see it at the request layer, where the actual work is observable. No amount of upfront forecasting will close the gap, because the work itself is stochastic.

Higher cost does not deliver higher accuracy

The paper segments runs into cost quartiles and finds that accuracy peaks at the second-cheapest quartile and plateaus from there. The most expensive runs do not deliver better outcomes than modestly priced ones⁶.

The authors trace this to a specific behavioural pattern: in the highest-cost quartile, repeated file modifications are roughly 4x more frequent than in the cheapest quartile, and repeated file views are 2x as frequent⁷. The expensive runs are not doing more work. They are doing the same work, again, on the same files.

The paper politely describes this as “unproductive exploration rather than deeper reasoning.” We see the same shape in non-coding AI usage. Repeated regeneration of the same artefact with marginal changes. Long-running sessions where the user disengaged hours ago. Identical prompts re-issued after a typo correction. None of these are agent failures; they are user-driven patterns the agent inherits.

The measurement gap

None of the patterns above can be addressed at scale without measurement at the request layer. Vendor dashboards aggregate by tool and tenant. AI gateways (a proxy that sits between you and your AI provider) cover server-side production routing. Engineering effectiveness tools cover narrow coding assistants and stop there.

We built Flowstate because the measurement layer required to actually act on these patterns didn’t exist anywhere in the stack.

Flowstate observes every AI call a user makes, whether it’s ChatGPT in the browser, Claude Code in the terminal, Midjourney for images or Suno for audio, and ties each call back to a user, project, model and cost class⁸. Customers keep their own contracts and their own API keys with every provider they use. We don’t sell access to AI and we don’t restrict which tools people can reach for.

That architectural position has consequences beyond cost measurement. The same instrumentation that surfaces token wastage also surfaces patterns that matter for security. Prompts containing customer PII heading out to a consumer AI tool. Source code pasted into ChatGPT. Employees running side projects on the company subscription. We see these in the field solely because the request layer is the only place they are visible.

The Stanford paper makes a clean economic case from a benchmark. Our observations make the exact same case from real corporate environments. The patterns driving AI cost are large, measurable and consistent. You just need the plumbing to see them.

Bai, L., Huang, Z., Wang, X., Sun, J., Mihalcea, R., Brynjolfsson, E., Pentland, A., and Pei, J. (2026). How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks. arXiv:2604.22750v2. The authors acknowledge concurrent work on token distribution in multi-agent systems (Salim et al. 2026; Wang et al. 2025) and pricing dynamics in reasoning models (Chen et al. 2026), but the combination of scale, granularity and open data release makes this paper the most useful one I’ve seen for understanding what agentic spend actually looks like. The authors also publish a project website with the trajectory dataset, an analysis code repo for replicating the figures, and a fun interactive Can You Guess the Token Cost? game that drives the paper’s headline finding home in about thirty seconds. ↩
Bai et al., Figure 1. Agentic coding averages 4.17M tokens per task and $1.86 in cost, against 3.39k tokens for code-chat tasks and 1.19k tokens for single-turn code reasoning. The 1,000x figure is the ratio against reasoning; against chat it is roughly 1,200x. ↩
Bai et al., Figure 6 and Section 4. Section 4 specifically addresses the “harder tasks naturally cost more” objection by showing the gap persists on the shared-success subset (n=230 tasks solved by every tested model). The authors describe the difference as “model-specific behaviour rather than intrinsic task difficulty.” ↩
Bai et al., Figure 2a and 2b. Up to 30x variance across instances; on the same task across four runs, the most expensive run costs roughly 2x the cheapest on average. ↩
Bai et al., Figure 10 and Figure 11. Best correlation across all eight models is 0.39 (Claude Sonnet 4.5, output tokens). Input-token prediction is uniformly worse than output-token prediction. Every model underestimates systematically; Figure 11 shows predictions clustering well below the diagonal across the board. ↩
Bai et al., Figure 3b. Accuracy increases significantly from the cheapest to the second-cheapest quartile, then plateaus. The third and fourth quartiles are not statistically distinguishable from the second. ↩
Bai et al., Figure 4 and Appendix A. Mixed-effects regression coefficients of roughly 4x for repeated modifications and 2x for repeated views at the highest-cost quartile, both significant at p < 0.001 against the minimum-cost group, controlling for model identity. Output-token analysis in the appendix shows the same pattern. ↩
Flowstate. I co-founded it, so the obvious conflict-of-interest disclosure applies. ↩