Local AI workflow

How to Run Claude Code with Local AI Models Without Breaking llama.cpp KV Cache (Kept Up-to-Date)

Local backends like llama.cpp, llama-server, and LM Studio are fastest when the beginning of the prompt stays the same. If Claude Code keeps adding changing metadata or git context, the backend may miss the KV cache and re-process the large system prompt again. The setup below keeps the prompt more stable.

Problem origin: Reddit LocalLLaMA PSA.

TL;DR: use this version.

1. Start llama-server

/path/to/llama-server \
  --model /path/to/your-model.gguf \
  --jinja \
  --reasoning auto \
  --threads 12 \
  --n-gpu-layers 99 \
  --flash-attn on \
  --mlock \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --cache-ram 24576 \
  --ctx-checkpoints 128 \
  --checkpoint-every-n-tokens 1024 \
  --slot-prompt-similarity 0.01 \
  --host 127.0.0.1 \
  --port 8080 \
  --parallel 1 \
  --cont-batching \
  --metrics \
  --slots

If you use a TurboQuant build, use:

--cache-type-v turbo4

2. Start Claude Code

ANTHROPIC_BASE_URL=http://127.0.0.1:8080 \
ANTHROPIC_API_KEY=no-key \
ANTHROPIC_MODEL=your-model.gguf \
CLAUDE_CODE_ATTRIBUTION_HEADER=0 \
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 \
DISABLE_TELEMETRY=1 \
DISABLE_ERROR_REPORTING=1 \
claude --bare \
  --model your-model.gguf \
  --dangerously-skip-permissions \
  --exclude-dynamic-system-prompt-sections \
  --settings '{"includeGitInstructions":false}' \
  --allowedTools "WebFetch,Read,Edit,Write,Bash(curl:*)"

Core idea: fewer dynamic prompt changes means better prefix reuse, less repeated prefill, and faster local Claude Code experiments.

Local Test Result

To demonstrate and verify the setup, we ran the experiment on a MacBook Pro with Apple M3 Max using a Qwen3.6 27B model. We asked Claude Code to generate a single-file HTML Gomoku game across multiple rounds, then checked llama-server timing output, prompt-eval tokens, and generation speed to confirm whether the cache behavior was working.

Local llama-server and Claude Code setup running on a MacBook with a Qwen3.6 27B model — The local llama-server and Claude Code setup used for the Gomoku test.

Claude Code ran normally against the local endpoint, and the timing logs show the cache problem was fixed for the repeated turns.

Claude Code successfully generating the Gomoku HTML game — Claude Code generating the single-file Gomoku game during the local-model test.

First request:
prompt eval time = 271184.95 ms / 34947 tokens

Next request:
selected slot by LCP similarity, sim_best = 0.989
restored context checkpoint
prompt eval time = 5136.44 ms / 441 tokens

Cache proof: repeated prompt-eval work dropped from 34,947 tokens on the first request to 441 tokens on the next request after llama-server restored the context checkpoint.

Perfect! We also checked the recording side. Even when Claude Code was pointed at a local third-party provider instead of Anthropic, DataMoat captured the work trail, token breakdown, and thinking tokens, including details that were not visible in the Claude Code UI.

DataMoat capture details showing the Claude Code run and token information — Captured run details and token breakdown from the local-provider test.

Next, we tested tool calling by asking Claude Code to open Chrome and take a screenshot.

Claude Code tool-calling test opening Chrome and taking a screenshot — Tool calling worked during the cached local-model session.

DataMoat showing the captured and encrypted screenshot artifact — The screenshot artifact was captured and encrypted by DataMoat, even though the Claude Code UI did not surface the image file itself.

Amazing! This confirms the cache path works for ordinary coding turns and tool-calling turns in this test setup.

Thanks for reading. We will keep this setup updated as llama-server, local models, and Claude Code change. Bookmark or share it if the notes are useful.