Save on token costs using OpenClaw

Why does OpenClaw use so many tokens and how to reduce token useage.

Tim Wappat

28 Mar 2026 • 16 min read

What are you using for LLM? Ive been burning $100 a day
Agents are just greedy for AI tokens, Claude Code, AntiGravity, OpenClaw, its the nature of them. So be aware, "setting it and forgeting it", is dangerous when "it" never stops talking to claude API. I experienced this after the first hours of running OpenClaw for the very first first time. Yes I spent as much in a few hours with OpenClaw, as I do normally in a month on my normal AI subscription. Really, it is that awful, my first experiences with agents really made me realise that optimising token consumption is something to think about.
When you understand that OpenClaw is a state engine for AI and further then when you understand how it works, it then becomes clear why it is such a beast for token consumption. The reason lies in how OpenClaw constructs every single request sent to the model, packed with much more context than your desktop AI experience. Forming an basic understanding of the context, heartbeat and cron processes will help you navigate the costs associated with Claw. Ultimately being aware of how it works, let you run OpenClaw in ways to minimise costly habits or actions.

How Token Consumption Actually Works

A chat based LLM interaction is simple, you send a prompt, the model returns a completion after inference. Input tokens in, output tokens out. Simple AI APIs such as chat completion or Gemini GenerateContent APIs work like this and you get a lot of control over context.

Now using a typical desktop AI interface, a lot more context is added to each request, the history of the chat is bundled in with each subsequent request to create a stateful conversation.

Then we have OpenClaw, which is adding a heap more sources of state and volume of history, to create a stateful engine. This extra context gives us the, interacting with a human employee experience. However this employee is like Dave from finance, who volunteers for everything then bills it as overtime. "I'll check your calendar!" "I'll summarize that PDF!" "I'll think about your problem for 20 minutes!" Cool, Dave. That's £12 worth of thinking! Every time you send a message to your OpenClaw agent, be it a simple question or a problem task, then the system packages up all this content and history in the payload before it ever reaches the model. OpenClaw makes extensive use of markdown format files is used to create persistence of information over time, a ruidimenary memory system (not unlinke my use of this blog - lol). The content of these files is submitted each time a something is processed by the AI engine.

The payload that the context engine builds which then accompanies your requests includes:

System prompt - the hardcoded behavior rules baked into OpenClaw itself rules, tools, skills list, time/runtime, and injected workspace files agents.md, soul.md, memory.md, and any others present in your setup (significant content volume)
Tool calls/results + attachments - command output, file reads, images/audio, etc. (can become very significant)
The full conversation history - your messages + the assistant’s messages for this session, appended in full
Attachments/transcripts (images/audio/files)
Compaction summaries and pruning artifacts
Provider “wrappers” or hidden headers (not visible)

Think about that last point, OpenClaw doesn't summerise or prune your session history automatically. It sends the whole thing, every time. So if you've been running a single session for several days and your agent has executed a dozen tool calls along the way, every new message you send carries all of that accumulated context as input tokens.

What is inside every openclaw request

A simple question like "what's the current status of my work item" might look like a few hundred tokens on the surface. In practice, with a week-old session and a populated memory file, you will be looking at tens of thousands of tokens per message.

This is super important to fully grasp, the heartbeat that gives OpenClaw it life, will beat all through the day, every day, processing the full context on each beat, even if there is nothing much to be done. This alone can cost you many £$ per day. Consider turning off your heartbeat if you don't need it.

Context commands to explore

/status - view + session settings.

/context list - rough sizes (per file + totals).

/context detail - per-file, per-tool schema sizes, per-skill entry sizes, and system prompt size.

/usage tokens - append per-reply usage footer to normal replies.
useage tokens.png
/compact - summarize older history into a compact entry to free window space

OpenClaw harnesses Agentic AI, so it is even worse

This gets even worse for autonomous tasks. When you ask OpenClaw to do something that requires multiple steps say updating its configuration and restarting the gateway service then reporting back status as it goes, this is not one single call to the AI model. Each little step is its own cycle, that sends the full context again and again. A task that you think is like one request can easily be five.

This is fundamentally different from those bare API calls to a chat completions endpoint. Those calls are cheap as you control exactly what goes in the context with the call and by its nature it is minimal. To achieve what OpenClaw does, that OpenClaw power, the autonomy, tool use, persistent memory, that all comes with a significant cost in token consumption to process all that extra context far more frequently that traditional AI use would do.

MD Files that drive the system

As introduced earlier, the power of persistence is granted to OpenClaw, largely via the use of markdown "resource" files. Each agent gets its own workspace (folder) with a set of it's own markdown files. This is what gives agents independant personality, tools etc. These markdown files get included in every request. Understanding what each one does will help keep your costs more controlled. You can see these files in the dashboard, under agents.
Agents (1) >> Select workspace Agent (2) >> Files (3) >> Select Markdown File (4)

You can preview the files using the Preview button.

OpenClaw Agents Soul Tool Indentity User HeartBeat Bootstrap Memory md files

OpenClaw injects the contents of the following files directly into the agent context on creation of a sessions:
AGENTS.md — operating instructions + “memory”
SOUL.md — persona, boundaries, tone
TOOLS.md — user-maintained tool notes for how they are used
BOOTSTRAP.md — one-time (first-run) ritual
IDENTITY.md — agent name/vibe/emoji
USER.md — user profile + preferred address

Compacted version of SKILLS.md is included - but only when needed.

Large files are tructed per-file, but you get the idea, there is a lot going through the context.

Hey, lets look at some of these in more detail,

agents.md

This is the operational overview for each agent, loaded at the start of every session, defining how the agent should behave, how to respond, what tools to prefer, when to spin up other agents, and any task specific instructions that have been added. Good place for rules, priorities, and "how to behave" details.

Because this file is "manually" maintained, it tends to be reasonably stable but it can accumulate redundant or duplicated instructions across sessions. A bloated agents.md adds overhead to every single request.

How to keep efficient: Review this file periodically. Use an AI model to compact it and de-duplicate it. Look for anything that can be shortened without losing meaning. The aim is to convey the meaning with minimal bloat to avoid unnecessary token consumption.

There are however instructions worth adding if not already there:

Limit response length - instruct your agent to respond in one or two paragraphs by default and wait to be asked for more. Long responses consume output tokens and get then get reused as context on the next turn. Think about how long the responses are in desk top AI tools, in that context a few paragraphs of response to every request gives a rich experience, but you want concise short responses for OpenClaw.
Skip commentary - tell your agent not to say, ("Let me find that for you…") etc. This narrative creates extra model cycles in front of the actual action, meaning the context gets processed yet more times.
Use sub-agents for heavy tasks - any significant niche tasks such as, research, coding, or multi-step processing should be handed to a sub-agent with its own session context. That keeps your main session clean and smaller. This is good practice anyway, as it allows the appropriate AI model to be targeted for the nature of the task, or for cheaper alternatives to be used where quality is not as important.

soul.md

Loaded every session for persona, tone, and boundaries. Agent personality and communication style. The tone, character, how it interacts with you. Normally it's small and stable, but it still gets included in every request, so keep it slim and concise.

USER.md

Loaded every session. Who the user is and how to address them.

TOOLS.md

Notes about your local tools and conventions. Does not control tool availability it is for instructions on how to apply and use tools.

HEARTBEAT.md

Optional tiny checklist for heartbeat runs. Super important to keep it short to avoid token burn.

BOOTSTRAP.md

One-time first-run ritual. Only created for a brand-new workspace. Delete it after the ritual is complete.

memory/YYYY-MM-DD.md

Daily memory log (one file per day). Read today + yesterday on session start to give the agent an anchor in time.

memory.md

Curated long-term memory. Only loaded in the main, private session.
This is the one to watch most closely as its the culprit for large token consumption. memory.md is a memory file that your agent appends to with information from past conversations, preferences expressed, things learned about your workflows. Unlike the other files, this one grows continuously and automatically.

After several months of active use, memory.md can become super large. And because it's included in every request, a bloated memory file is one of the most direct ways to silently inflate your per message token cost.

What to do:

Audit it regularly. The same pattern applies as with agents.md, ask a AI model to identify redundancy, outdated entries, or anything that can be condensed and compacted.
Consider scoping memory by context. Rather than one monolithic memory file, you can maintain separate memory files for separate domains, only include the relevant one per session.
For session-specific notes, use a temporary file rather than writing to memory.md. At the end of a session, decide deliberately what actually deserves to persist.
I would highly recommend looking at https://github.com/tobi/qmd, get your OpenClaw to install QMD, it introduces a more efficient & better memory system.
Especially relevant if you don't use OpenAI, Gemini, Voyage, or Mistral APIs.

QMD does not replace the memory.md, but it does replace the memory context section that is provided to the AI engine. QMD uses BM25 keyword search, vector similarity search and LLM re-ranking to check the relevance of the result before using it (local Qwen3 model via node-llama-cpp). QMD uses a light weight local SQL database to get much more specific and smaller context to use for the memory part of the request, avoiding bundling as much from history every single request. The built in memory system does something similar but needs a API key that supports vector search, unlike QMD.

How QMD combines keyword search and vector search using Qwen3 local model to merge and select results.
QMD Lookup using BM25 Vector and Qwen3

While this is perhaps an oversimplification, as the default memory handler does some of this too, but here we see how QMD improves the accuracy and size of the context by doing it better.
How QMD improves efficiency in prompt context

About the default memory handling
SQLite database and needs, no extra dependencies
Keyword search via FTS5 full-text indexing (BM25 scoring).
Vector search via embeddings from any supported provider (OpenAI, Gemini, Voyage, or Mistral, the builtin engine auto-detects it and enables vector search).
Hybrid search that combines both for best results.
CJK support via trigram tokenization for Chinese, Japanese, and Korean.
sqlite-vec acceleration for in-database vector queries (optional).

Consider switching to QMD if you need reranking for higher-quality results, query expansion, or want to index directories outside the workspace such as project docs or notes, fully local search with no API keys that support search, to recall past session conversations.

About QMD
QMD is a local-first search sidecar that runs alongside OpenClaw. It combines BM25, vector search, and reranking in a single binary, and can index content beyond your workspace memory files.

Reranking and query expansion for better recall.
Index extra directories — project docs, team notes, anything on disk.
Index session transcripts — recall earlier conversations.
Fully local — runs via Bun + node-llama-cpp, auto-downloads GGUF models.
Automatic fallback — if QMD is unavailable, OpenClaw falls back to the builtin engine seamlessly.

Honcho is another memory options, it facilitates use of a service that allows agent cross session context that goes beyond the workspace Markdown files. This is a more specialist option beyond what I need to cover for this discussion.

Choosing good memory handling will reduce token conumption directly with token submission efficiency and indirectly by being more accurate, thus requiring less re-submissions when results don't match expectations.

Tracking: If memory.md has grown significantly, open it up and trim it. Do a quick scan to remove obvious bloat.

Session Management

Session hygiene is one of the most impactful things you can do for token efficiency, and it's often overlooked because the costs accumulate gradually and invisibly.

An open session accumulates context indefinitely. Every message, every tool call result, every agent response, all of it gets appended and resubitted on the next turn. A session running for a few days, without a reset is carrying a lot of bloat. Auto-compaction is on by default. It runs when the session nears the context limit, or when the model returns a context-overflow error (in which case OpenClaw compacts and retries), however here we are talking about token use reduction, so take charge and compact where you see opportunity to.

Interestingly you may configure a different model than the default model to handle compaction summaries. This may give better results in some cases, or a cheaper model could be used for compacting if quality is not a concern.

Commands to help manage sessions

OpenClaw exposes a few slash commands that are worth using pro-actively from your telegram session:

/status - Shows you the current session state, which model is active, how many tokens are in the current context window, and your consumption so far. Run this periodically to get a concrete sense of where you stand. If you're pushing 800k tokens in a session that has a 1M limit, you know to act.

/compact - Compresses the current session context without starting a new session. If you are in the midst of a complicated task and don't want to lose continuity, this is the tool to reach for. It can reduce a bloated context to a fraction of its size while preserving the important state.
Older conversation turns are summarized into a compact entry.
The summary is saved in the session transcript.
Recent messages are kept intact.
The full conversation history stays on disk. Compaction only changes what the model sees on the next turn.

/new - Starts a fresh session, clearing the context bloat. What you can do, if you need to preserve continuity, before running this, have your agent write a summary of where things stand to a temporary markdown file, current state, blockers, next steps etc. Start the new session and point the fresh agent at that file. That will allow only the essentials to come through to the next session, much like what you would do on desktop AI.

/model - Switches models mid-session. This is helpful if you wish to target specific models for particular things. Or want to keep cost down by switching to less expensive model for some tasks.

OpenClaw Status New Session Command with in Telegram
Above we can see the session context drop from 9% to 2% on creation of a new session. This also shows how there is an overhead from system like prompts in a session, as the session did not drop to 0%.

Session Pruning

Session pruning is a lighter version than /compact. Pruning trims old tool results from the context right before each LLM call, without touching the on-disk history at all. This contracts with compact that is rewriting session history into a summary, hence much more work.
Tool outputs accumulate, increasing token cost and forcing compacting more often. Every exec call, file read, or status check appends its output to the session context. Without pruning it rides along in the prompt into the future. User and assistant messages are never touched only old tool output results.
The most cost-relevant mode is cache-ttl. When a session goes idle past the TTL window, the next request would re-cache the full context from scratch. Pruning before that first post-idle request reduces what gets re-cached, so you are not paying full cacheWrite costs on everything the session has ever accumulated. Pruning is off by default for non-Anthropic providers, see docs to turn it on.
Compaction summarises and persists, pruning is transient and per-request. openclaw They are complementary, not alternatives. Pruning is the passive background hygiene, compaction is the deliberate heavier reset. Run both.

	Pruning	Compaction
What	Trims tool results	Summarizes conversation
Saved?	No (per-request)	Yes (in transcript)
Scope	Tool results only	Entire conversation

A Practical Session Strategy

Reset frequently for general-purpose conversations. Don't let interactions run for days, its easy to not realise on a long Discord or Telegram chat.
Use /compact for long-running work keeps token use down by reducing context.
Keep cron jobs - cron jobs fire without historical context, which makes them significantly cheaper than heartbeats that fire into an existing session.
Heartbeats A heartbeat firing every 10 minutes into an existing session is 144 executions per day, each carrying the full session context. At even a modest cost per execution, that's real money. Use a smaller model for heartbeats, and consider whether a cron job (which opens a new, clean session) would serve the same purpose. Reducing heartbeat to 40 mins frequency, would reduce executions to 36 per day, also something to consider, depending on how you use OpenClaw.

Model and Cost Optimisation

False economy of cheap models?

A smaller or less capable AI models running agentic workflows that have complex tool calls may frequently fail, keep looping, or produce crappy results, which means more retries, thus consuming more context that means more tokens to process it and so more cost than a capable model. So what I mean here, is don't just look at the cost, think about the value that model brings. Honestly give Claw the best model you can afford, or you are funning the Farrari on water.

Getting the right model for the task, a reasonable structure:

Primary assistant sessions: Sonnet class model, good reasoning, solid tool use, manageable cost
Heartbeats and lightweight monitoring: A smaller model is appropriate, as long as it has fair agentic capability
Complex onboarding, setup, or multi-step reasoning: Opus class for the task, then drop to Sonnet or smaller for follow up

Also remember OpenRouter is an option that gives you a single endpoint and API key with access to models from Anthropic, OpenAI, Google, Mistral, and others. Set it as your base URL in your config, then assign different models to different agents or tasks without juggling credentials.

Seriously consider a local models

Open source local models can also help take some of the burden from the paid for API models. I expect to see more an more use of local models as agentic systems mature and local models get more efficient and capable. At the time of writing models like Qwen3-Coder-Next can be used very effectively for your OpenClaw agents responsible for coding. The model excels at complex tool use, and recovery from execution failures. Using local models leaves you paying for hardware investment and power but not the api costs of a cloud model.

Offload Non AI Work

This is such an important thing to remember when working with AI. Not everything needs a AI LLM, especially repeating logical tasks. Scheduled reports, data retrieval, webhook handling, and conditional logic are all things that can be handled by an automation tool such as n8n, or make.com. These platforms are at near-zero cost, with the model only invoked when actual reasoning is required, or maybe not at all if its just a logical problem. A daily summary report routed through an n8n workflow with a single, minimal AI node will cost a fraction of the same task run directly through your main OpenClaw session.
Like wise use open claw to write a Lamda function or other software endpoints to allow work to be outsourced to that module. This is if AI is not really required for that task.

Take advantage of caching

Some models like Kimi K2.5, MiniMax have built in automatic prefix prompt/context caching, something that can reduce costs by three quarters when prompts hit the cache. This is very effective for claw as it repeats tasks with simliar system prompts a lot of the time, so using a caching model will bring big savings to those calls and is automatic for these models.
Prompt cache, Anthropic give you 90% discount for processing cached tokens vs new ones and Google and Kimi gives 75% discount. Antrhropic allows the cache expiry to be controlled, the default time for for cache expiry is 5 mins but we can set that that to 1 hour for, for say Heartbeats, which otherwise are high consumers of tokens. Check your AI model for caching features and implement them appropriately for big saving, and take the effect of cache into account when comparing which models to use, as cache as a feature has big financial impact.

Set Spending Limits to avoid shocks

OpenClaw lets you set credit limits at the API key level, so go set a limit for your budget, this prevents runaway spend from a misconfigured heartbeat or an unexpectedly expensive task. It is also wise to do this on your model providers too, look at what limits can be set in the account settings there.

Summary

OpenClaw is greedy, hey but so am I when there is free lunch at a conference, we understand each other. OpenClaw's token consumption cost is high because it is genuinely doing more than a simple API call, that is the whole point of the product! It's maintaining context, executing multi-step tool workflows, and managing persistent memory across sessions.

Keep your MD files lean. Audit agents.md, soul.md, and especially memory.md regularly. Consider one of the many alternative memory solutions.
Reset sessions. Use /compact when you need continuity, /new when you don't.
Match models to tasks. Use capable models where reasoning matters, then drop back to cheaper ones for routine work.
Offload non-AI work. Automation tools can handle scheduling, data retrieval, and conditional logic for free.
Set limits. Use a hard spending cap at the API key level is the simplest safeguard against unexpected cost shocks.