Claude Code, llama.cpp, and the Hidden Prompt Cache Killer

I was running Claude Code through a local llama.cpp server with a large Qwen model for vibe coding. The setup was powerful enough: RTX 4090, long context, MTP enabled, prompt cache enabled, and a model capable of handling very large requests.

But something was wrong.

Every few requests, llama.cpp kept showing this warning:

forcing full prompt re-processing due to lack of cache data

That meant llama.cpp was not reusing the previous prompt/KV cache properly. Instead of continuing from an existing cached context, it was often reprocessing tens of thousands of tokens again.

For small chats, this is annoying.

For vibe coding, it is painful.

Claude Code can send very large prompts: system instructions, repository context, file contents, tool results, previous messages, and the current task. In my case, some requests were tens of thousands of tokens, and sometimes even much larger. Reprocessing that much context again and again kills the experience.

On this page

The misleading part
The real cause
The fix
My current llama.cpp settings
Final thought

The misleading part

At first, I treated it like a llama.cpp tuning problem.

I tried adjusting settings like:

--cache-ram 15000
--ctx-checkpoints 128
--checkpoint-min-step 128

These helped a bit. llama.cpp started creating more checkpoints and had more RAM available for prompt cache. But the real problem was still there.

The clue was in the logs.

When the prompt was almost identical, llama.cpp could restore a context checkpoint:

selected slot by LCP similarity, sim_best = 0.996
restored context checkpoint
prompt eval time = 511.40 ms / 212 tokens

That was great. It meant llama.cpp reused almost the whole previous prompt and only processed the new tail.

But before the fix, this often did not happen. llama.cpp treated many requests as too different, even though from a human point of view they looked like a continuation of the same coding session.

The real cause

The issue was not mainly the model. It was not mainly MTP. It was not mainly context size.

The problem was Claude Code adding an attribution block at the beginning of the system prompt.

Claude Code has an environment variable called:

CLAUDE_CODE_ATTRIBUTION_HEADER

According to the Claude Code documentation, setting it to 0 omits the attribution block from the start of the system prompt. That block can include details like the client version and prompt fingerprint.

For hosted Claude API usage, this makes sense.

For a local llama.cpp backend, it can be a cache killer.

Prompt cache reuse depends heavily on stable token prefixes. If the beginning of the prompt changes, even slightly, the server may fail to reuse the existing context. With large coding-agent prompts, that means expensive full prompt re-processing.

The fix

Set this environment variable:

{
  "CLAUDE_CODE_ATTRIBUTION_HEADER": "0"
}

Or in Windows CMD:

set CLAUDE_CODE_ATTRIBUTION_HEADER=0

Or in PowerShell:

$env:CLAUDE_CODE_ATTRIBUTION_HEADER = "0"

After this change, llama.cpp started restoring checkpoints much more often.

The logs changed from this:

forcing full prompt re-processing due to lack of cache data

to this:

restored context checkpoint
prompt eval time = 511.40 ms / 212 tokens

That is the difference between reprocessing a massive coding prompt and only processing the newly added part.

My current llama.cpp settings

This is the kind of configuration I am using for this workload:

..\bin\llama-server.exe ^
  -m ..\models\Qwen3.6-27B-UD-Q4_K_XL.gguf ^
  --host 0.0.0.0 ^
  --port 8080 ^
  -c 170000 ^
  -ngl 99 ^
  -t 8 ^
  -b 512 ^
  --flash-attn on ^
  -np 1 ^
  -ctk q4_0 ^
  -ctv q4_0 ^
  --spec-type draft-mtp ^
  --spec-draft-n-max 2 ^
  --temp 0.6 ^
  --top-p 0.95 ^
  --top-k 20 ^
  --min-p 0 ^
  --jinja ^
  --reasoning-format deepseek ^
  -n 50000 ^
  --reasoning-budget 8192 ^
  --cache-ram 15000 ^
  --ctx-checkpoints 128 ^
  --checkpoint-min-step 128

The exact numbers are not universal. The important part is this:

CLAUDE_CODE_ATTRIBUTION_HEADER=0

Without that, llama.cpp may keep losing the prompt cache advantage because Claude Code changes the start of the prompt.

Final thought

This was a good reminder that local AI performance is not only about GPU, quantization, context size, or tokens per second.

For coding agents, prompt stability matters.

If your agent keeps changing the beginning of the prompt, your local inference server may be forced to do expensive work again and again. In this case, one small environment variable made the difference between constant full prompt re-processing and useful prompt-cache reuse.

References

Claude Code environment variables: https://code.claude.com/docs/en/env-vars
llama.cpp issue discussing full prompt re-processing with hybrid/recurrent-memory models: ggml-org/llama.cpp#21831

The misleading part

The real cause

The fix

My current llama.cpp settings

Final thought

References

Related Articles

Using OpenRouter with Claude Code CLI

Exposing Ollama on Raspberry Pi & Linux to Your Network

How to Run Ollama on One Windows PC and Connect from Another

💬 Comments & Reactions