So you’re mid-task and Claude stops you with some version of a context length exceeded error. It happens in the claude.ai chat window, it happens in API integrations, and it happens in Claude Code during a long session — and each one means something slightly different. I’ve hit all three versions of this, and the fix is almost never “just use a bigger model,” even though that’s the first thing people suggest.
Quick Answer
- In claude.ai chat: start a new conversation, or break a huge first message into smaller chunks
- In the API: check your
max_tokensvalue — it counts against the limit even before generation starts - In Claude Code: run
/compactmanually instead of waiting for auto-compact to kick in - Use the token counting API to check size before sending, rather than guessing
- If this happens constantly on the same task, the structure of the task is the actual problem, not the model
Why This Error Happens
The core idea is the same everywhere: every model has a context window, a fixed amount of “space” for input and output combined. But how that limit gets enforced, and what error you actually see, depends on where you’re working.
You’re including more than you think you are. In long conversations, every previous turn stays in the context — your messages, Claude’s responses, attached files, everything. People remember the file they uploaded but forget the fifteen back-and-forth messages around it.
max_tokens reserves space whether you use it or not. This is the one that trips up API users constantly, and it’s genuinely easy to miss. Your input tokens plus your requested max_tokens get checked against the context limit together — so if you set max_tokens to something huge “just in case,” you can hit the error even with a fairly short prompt. From what I’ve seen, this is the single most overlooked cause in API integrations. People stare at their input token count, confirm it’s nowhere near the limit, and don’t think to check the output budget they reserved.
Tool results and file content pile up fast in agentic workflows. A single large file read or a verbose tool output can eat a surprising chunk of the window. Claude Code sessions especially run into this — one big grep result or log dump, and you’re suddenly much closer to the ceiling than the conversation length would suggest.
Extended thinking and multi-turn tool use add their own accounting. Thinking tokens count toward the window during the turn they’re generated, even though previous thinking blocks get stripped out automatically before the next turn. It’s not always obvious from the outside how much of your budget thinking is using up at any given moment.
And one more that’s easy to forget: different models simply have different limits. Most Claude models top out at 200,000 tokens of context, but several of the newer models support a 1-million-token window on the API, Bedrock, and Vertex AI. Mixing models across a pipeline without checking which is which is a quiet way to get inconsistent errors.
Common Scenarios
This shows up differently depending on where you’re working:
- claude.ai chat: you’ll see a message saying your message will exceed the length limit and suggesting fewer or smaller attachments, or a new conversation
- Claude API / SDK integrations: older models return a 400
invalid_request_errorwith an exact token breakdown; newer models accept the request and the response simply stops early with amodel_context_window_exceededstop reason - Claude Code sessions: long coding sessions, especially ones touching big files or running verbose commands, trigger auto-compaction — and if a single file is so large that context refills right after each summary, Claude Code stops trying and shows an error instead of looping forever
- Mobile and desktop apps: functionally the same as claude.ai chat, since they run the same conversation limits
Comparison: Where You’re Hitting It
| Environment | What you see | First thing to check | Typical fix |
|---|---|---|---|
| claude.ai chat | “Your message will exceed the length limit…” | Size and number of attached files | Shorten the message, or start a new conversation |
| Claude API (older models) | 400 invalid_request_error with token math | Your max_tokens setting vs actual need | Lower max_tokens, trim input |
| Claude API (newer models) | Response stops with model_context_window_exceeded | Total input + max_tokens vs model’s window | Token counting API before sending, or enable compaction |
| Claude Code | Auto-compaction, or a thrashing error on repeat | Run /context to see what’s eating space | Manual /compact, or /clear between unrelated tasks |
Step-by-Step Fixes
Step 1: Figure out which environment you’re actually in
Sounds obvious, but the fix is completely different depending on whether you’re in the chat app, calling the API directly, or running Claude Code. Don’t apply an API fix to a chat problem.
Step 2: For claude.ai — shrink the message, not the conversation
If you’re on a paid plan with code execution enabled, Claude already summarizes older messages automatically as you approach the limit, and your full history stays available for reference even after that happens. So this error mostly shows up on a single very large first message. Break large pastes or documents into smaller pieces, or ask Claude to work from a summary first.
Step 3: For the API — check max_tokens before you check input length
# Before sending, estimate first
response = client.messages.count_tokens(
model="claude-sonnet-4-6",
messages=[{"role": "user", "content": your_prompt}]
)If your input tokens plus max_tokens add up past the model’s window, you’ll get an error or an early stop, depending on the model generation. Lower max_tokens to something realistic for the task rather than defaulting to the maximum.
Step 4: For long-running API conversations, look at compaction and context editing
Server-side compaction automatically condenses older parts of a conversation once you cross a token threshold you set. It’s currently in beta and needs an explicit beta header to enable. For more surgical control, context editing lets you clear old tool results or thinking blocks instead of summarizing the whole history.
Step 5: For Claude Code — stop waiting for the warning
Run /context to see what’s actually consuming space — files, tool definitions, conversation history, all broken out separately. If one MCP server’s tool definitions are taking a big chunk, disable it with /mcp when you’re not using it. Run /compact proactively at natural breakpoints instead of letting it trigger automatically mid-task.
Step 6: If it keeps happening on the same task, restructure the task
This is the fix people skip. If you’re constantly bumping the ceiling on one ongoing project, the actual fix is breaking the work into separate sessions with files or CLAUDE.md carrying the important state between them, not finding a bigger window.
What Actually Worked For Me
I’ll admit my first instinct was the wrong one. I was deep into a Claude Code session refactoring a fairly large codebase, hit the context warning, and just ran /clear without thinking — which wipes everything, no summary, nothing carried forward. That cost me a chunk of the session’s working context and I had to re-explain half the plan.
The thing that actually worked, and it came from something I’d half-remembered from a teammate mentioning it weeks earlier, was running /compact instead, with a short instruction telling it specifically what to keep — the file list and the architectural decisions we’d made, not the back-and-forth that got us there. That preserved the part that mattered and dropped the noise. I also started putting the genuinely critical rules into CLAUDE.md from then on, so they survive compaction regardless of what gets summarized away.
Not every case is this clean, though. On a separate API integration, the fix was almost embarrassingly simple — someone had copy-pasted an example with max_tokens set to a huge number “for safety” and never adjusted it. Lowering that one value fixed an error that looked, from the outside, like a much bigger problem.
Advanced Fixes and Edge Cases
Auto-compaction thrashing. If a single file or tool output is large enough that context fills back up immediately after each compaction summary, Claude Code stops trying after a few attempts rather than looping forever, and shows an error instead. The fix usually means avoiding loading that specific file or output wholesale — read specific sections instead of the whole thing.
Extended thinking with tool use. If you’re combining extended thinking with tool calls, the entire thinking block tied to a specific tool request has to be returned unmodified, including its signature. Tampering with it breaks the verification and the API will return an error that has nothing to do with raw token count, but gets mistaken for a context problem.
Mixed-model pipelines. If different steps of a pipeline use different Claude models, check each one’s context window individually rather than assuming they match. Some current models support up to 1 million tokens on the API, Bedrock, and Vertex AI, while most others sit at 200,000 — assuming the bigger number across the board is a quiet source of intermittent failures.
Diagnosing in Claude Code specifically. /context gives a breakdown of what’s actually using space, and /mcp shows per-server tool definition costs — genuinely useful before assuming the conversation itself is the problem.
Prevention Tips
- Set
max_tokensto what the task actually needs, not the model’s maximum - Use the token counting API before sending in production, rather than discovering the limit at runtime
- In Claude Code, run
/compactat logical breakpoints instead of waiting for the automatic trigger - Keep durable, important instructions in
CLAUDE.mdrather than relying on them surviving in chat history - Don’t load entire large files or logs into context when a specific section would do
Frequently Asked Questions
Does switching to a bigger context window model fix this for good? Sometimes, but it’s commonly recommended and often doesn’t actually solve anything if the real cause is an oversized max_tokens setting or one giant tool output — you’ll just hit the same wall later with more padding around it.
Will starting a new conversation lose everything? In claude.ai, your full chat history sticks around for reference even after older parts get summarized. In Claude Code, /clear does wipe it, but /compact doesn’t.
Why did this happen on a message that wasn’t even that long? Check what’s attached or what got pulled in by tools — a short message with a large file attached, or a long tool result behind the scenes, adds up fast.
Is this the same as a rate limit error? No. Rate limits are about how many requests or tokens you can send over time. Context length is about how much fits in one conversation at once. They look similar in a panic but they’re unrelated problems.
Editor’s Opinion
the max_tokens thing genuinely got me the first time, it’s not intuitive that a number you set for output space counts against you before anything even gets generated. claude code’s /compact is solid once you stop treating it like an emergency button and start using it on purpose. honestly the chat app version of this error is the easiest one to deal with, it’s the agentic and API stuff where it actually requires some thought.
