How to Stop Burning Through Your AI Agent Budget

Why This Matters

We run an AI agent on OpenClaw 24/7. We're on Anthropic's Claude Max plan and we hit 92% of our weekly usage cap with two days left on the clock. One developer documented spending $500 in a single month on API billing [7].

The cause wasn't the model — it was how we were using it. Claude Opus for everything: main conversation, heartbeat checks, background research, social media monitoring. Same model, same cost, regardless of whether the task needed frontier reasoning or a yes/no check.

But when we dug in, we realized model selection was only part of the problem. The bigger issue was that we didn't understand how OpenClaw actually uses tokens. Once we did, the optimization strategy became obvious.

How OpenClaw Uses Your Tokens

Every time OpenClaw calls a model — whether it's your message, a heartbeat, or a sub-agent task — it sends two things: the system prompt and the conversation history. Understanding both is the key to optimization.

The system prompt

OpenClaw builds a custom system prompt for every agent run [5]. It's not a one-liner — it includes:

Tool list + JSON schemas — every tool the agent can use, with full parameter schemas. The more tools you enable, the bigger this gets.
Skills list — metadata for each installed skill (name, description, file path). Skill instructions are loaded on-demand when needed, but the list itself is always present.
Workspace bootstrap files — AGENTS.md, SOUL.md, TOOLS.md, IDENTITY.md, USER.md, HEARTBEAT.md, and MEMORY.md are injected verbatim into every main session turn (each truncated at 20,000 chars).
System instructions — safety guardrails, reply format, heartbeat behavior, runtime metadata.

This is fixed overhead — it doesn't change between turns but it gets sent every single time. You can inspect exactly what yours looks like and how much each piece costs with /context list (summary) or /context detail (full breakdown by file, tool schema, and skill entry).

Conversation history: the part that grows

The system prompt stays constant. What grows is the conversation history: every message you send, every response, and every tool call result accumulates in the context window.

On your first message, the model sees just the system prompt plus your message. By turn 50, it sees the system prompt plus every exchange that came before. By turn 100+, you're sending a massive payload on every single turn — and paying for all of it.

This is the real cost driver. A long conversation on a cheaper model costs more than a fresh conversation on an expensive one. Context grows with every exchange, and you pay for the full history every turn.

How OpenClaw manages growing context

OpenClaw has two built-in mechanisms to keep context from spiraling out of control [5]:

Compaction — When a session nears the model's context window limit, OpenClaw auto-compacts: it summarizes older conversation into a compact entry and keeps only recent messages intact. Before compaction runs, a silent memory flush reminds the agent to write any important context to disk (memory/YYYY-MM-DD.md) so durable facts survive even though the full conversation doesn't. You can also trigger this manually with /compact.

Session pruning — Separate from compaction, OpenClaw can trim old tool results (file reads, command output, web fetches) from the context before each model call. Tool results are often the largest items in context. Pruning removes or truncates them from older turns while keeping recent ones, without rewriting your session history.

Both happen automatically. But you can make them work better by understanding the next piece.

Memory: on-demand vs. always-on

OpenClaw gives the agent information two ways:

Workspace files — injected into the system prompt, sent on every turn
Memory files (MEMORY.md, memory/*.md) — stored on disk, retrieved via semantic search only when relevant

The memory system builds a vector index over your memory files. When the agent needs to recall something, memory_search finds relevant snippets semantically — even when the wording differs — and pulls only what's needed into context [5].

This is the single most actionable optimization. Anything that doesn't need to be in every turn should live in memory/, not in workspace files. Your AGENTS.md should have core rules and behavior essentials. Your SOUL.md should have personality essentials. Detailed reference material, research notes, project history — all of that belongs in memory/ where it's available on demand but not burning tokens on every heartbeat.

Rule of thumb: If you reference something less than once per session, it belongs in memory/, not in a workspace file.

Where the Waste Hides

An always-on OpenClaw agent has four token consumption channels. They're not equal — and the biggest one probably isn't what you think.

Main conversation

Your direct chats. Full system prompt + conversation history + tool calls. This is where you need intelligence: complex reasoning, creative work, nuanced decisions. This earns its cost.

Heartbeats (the biggest consumer)

OpenClaw polls the agent every 30-60 minutes in the main session [5]. Each heartbeat sends the full system prompt plus the entire conversation history. Even a simple "HEARTBEAT_OK" response requires the model to process all of it.

If your HEARTBEAT.md triggers real work — social media engagement, job processing, multi-step decisions — each heartbeat generates even more tokens through tool calls and responses. With heartbeats running all day, this is almost always your largest token consumer.

The critical detail: the heartbeat model does ALL the work. Whatever your HEARTBEAT.md asks for, that model handles it. If your checklist involves social media engagement, code review, or tool-heavy workflows, you need a capable model. Don't set heartbeat.model to a budget model and expect it to manage complex tasks — you'll get garbage output and wasted tokens. Match the model to the complexity of what you're asking it to do.

Sub-agents

Background tasks in isolated sessions. They get a minimal system prompt (only AGENTS.md + TOOLS.md, no skills, no identity files, no heartbeat instructions) and fresh context with no conversation history. Inherently cheaper per-turn. But without a model override, they inherit your primary model — so a simple research task on Opus burns frontier tokens unnecessarily.

Cron jobs

Scheduled isolated sessions for social media posting, monitoring, periodic checks. Like sub-agents, they run in isolation with fresh context. OpenClaw supports per-job model overrides, making these the easiest place to use a cheaper model.

The Optimization Playbook

Now that you understand how the system works, here's what to do about it. In order of impact:

1. Keep your workspace files slim

Every character in AGENTS.md, SOUL.md, TOOLS.md, IDENTITY.md, USER.md, and HEARTBEAT.md gets sent on every main session turn. Trim aggressively. Move detailed content to memory/ files where it's available via semantic search but not costing you on every heartbeat.

We had a massive model comparison table in our AGENTS.md — detailed pricing, fallback chains, decision frameworks. We replaced it with three lines. The detailed research lives in a memory file where it's searchable but not injected into every turn.

Run /context list to see exactly how much each file contributes. If something is getting truncated (hitting the 20,000 char limit), that file is way too big.

2. Set heartbeat and sub-agent model overrides

This is the highest-impact config change. Without overrides, everything runs on your primary model:

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "anthropic/claude-opus-4-6",
        "fallbacks": ["anthropic/claude-sonnet-4-5"]
      },
      "heartbeat": {
        "every": "60m",
        "model": "anthropic/claude-sonnet-4-5"
      },
      "subagents": {
        "model": { "primary": "openai-codex/gpt-5.1-codex-mini" }
      }
    }
  }
}

Heartbeat model — Choose based on what your HEARTBEAT.md actually does. Simple triage ("anything need attention?") → Haiku is fine. Real work (social engagement, job processing, multi-step tasks) → Sonnet or better. Don't cheap out here if your heartbeat does real work.

Sub-agent model — GPT Codex Mini is included in a ChatGPT subscription at no extra per-token cost. It handles research, analysis, and code tasks well. For complex sub-agent work, use GPT Codex (also subscription-included, just burns quota faster).

Cron job model — Set per-job with "model" in the payload. Scheduled, predictable tasks usually work fine with Haiku:

{
  "payload": {
    "kind": "agentTurn",
    "message": "Check social notifications and respond",
    "model": "anthropic/claude-haiku-4-5"
  }
}

3. Use `/model` for on-the-fly switching

In chat, type /model sonnet to drop to Sonnet for simple tasks, /model opus when you need depth. No config change needed — it switches for the current session.

4. Enable prompt caching (API key users)

If you're on API key billing (not a subscription), Anthropic offers prompt caching that dramatically reduces the cost of the repeated system prompt [2]. OpenClaw automatically applies 5-minute caching for API key auth. You can extend to 1-hour:

{
  "agents": {
    "defaults": {
      "models": {
        "anthropic/claude-opus-4-6": {
          "params": { "cacheRetention": "long" }
        }
      },
      "heartbeat": { "every": "55m" }
    }
  }
}

Setting heartbeats to 55 minutes keeps the 1-hour cache warm — the system prompt doesn't need to be re-sent fresh each cycle.

Note: Prompt caching is not available on subscription auth (Claude Max, etc.) [2]. Subscription users optimize through model tiering and slim workspace files instead.

5. Understand your billing model

Subscription (Claude Max, ChatGPT): Flat rate, usage cap resets weekly. You're optimizing against that cap. Model tiering is your main lever — every heartbeat and sub-agent that runs on Opus instead of a cheaper model eats into the same cap as your actual work. A ChatGPT subscription also gives you Codex/Codex Mini for sub-agents at no extra per-token cost.

API key (pay-per-token): You pay exactly what you use. Model tiering plus prompt caching give you the most optimization levers.

Hybrid approach: Use a Claude subscription for main conversation (Opus without metering anxiety) and a ChatGPT subscription for sub-agents (Codex Mini, included). OpenClaw supports multiple auth profiles and fallback chains, so sub-agents route to Codex automatically while your main chat stays on Claude.

The Cheap Model Trap

After all this optimization research, we made what seemed like the obvious next move: try a cheaper model. Kimi K2.5 by Moonshot AI looked great on paper — mid-tier pricing, ranks #15 overall on Canary Arena [11], and independent reviewers praised its coding abilities [1][8].

We tried it as our primary model. Within one session, it hallucinated — confidently generating nonsensical output about "warping" a website that didn't exist. Not a subtle degradation. A hard failure.

The lesson: benchmarks don't measure production reliability. A model that's 90% as capable but hallucinates 5% of the time is worse than one that costs more but works every time. For agent work — where outputs chain into actions that chain into more actions — one hallucination cascades into wasted time and broken state.

Our updated strategy: Stick with frontier models from providers with proven track records (Anthropic, OpenAI, Google). Control cost through subscriptions and model tiering, not by swapping to cheaper alternatives. OpenRouter is still valuable for unique capabilities (extreme context windows, specific fine-tunes), but not as a cost-cutting default.

For current model rankings, check Canary Arena [11] or Artificial Analysis. For pricing, check your provider directly [2][3][4].

The Bottom Line

After a week of running a 24/7 AI agent and hitting our usage cap hard, here's what actually matters:

Context size is the real cost driver. Not model selection. Every turn sends the full conversation history plus the system prompt. Keeping your workspace files slim and using memory/ for reference material has more impact than any model swap.

Match the model to the task, not the other way around. Opus for main conversation. Sonnet for active heartbeats that do real work. Codex Mini for sub-agents (subscription-included). Haiku for simple cron jobs. Use /context list to see what you're paying for.

Don't cheap out on reliability. Frontier models earn their cost. Cheaper alternatives may benchmark well but hallucinate in production. For agents, reliability matters more than per-token price.

Understand how OpenClaw works. Compaction, pruning, memory search — these exist to help you. Slim workspace files, move detail to memory/, let compaction and pruning handle context growth, and set model overrides for heartbeats and sub-agents.

The biggest optimization isn't a config change. It's understanding that every token in your workspace files gets sent on every turn, and every turn of conversation history compounds the cost. Once you see that, the rest follows naturally.