Skip to content

Local LLMs

Local models (20B—80B parameters) are not suited for complex coding tasks where frontier models excel. They are useful for:

  • Summarization and Q&A over private notes
  • Working with sensitive documents that cannot be sent to external APIs
  • High-volume tasks where API costs would add up
  • Fully offline or air-gapped environments
  1. Start llama-server with a model — this makes the model available at a local endpoint (e.g. port 8123)
  2. Run Claude Code (or Codex CLI) pointing to that endpoint

Claude Code uses the Anthropic-compatible /v1/messages endpoint. Codex CLI uses the OpenAI-compatible /v1/chat/completions endpoint. Both are served by llama-server simultaneously.

  • llama.cpp built with llama-server in your PATH
  • Sufficient RAM (64 GB+ recommended for 30B+ models)
  • Models download automatically from HuggingFace on first run

At its simplest, connecting Claude Code to a local model is one line:

Terminal window
ANTHROPIC_BASE_URL=http://127.0.0.1:8123 claude

The helper function below is a convenience wrapper. Add it to your ~/.zshrc or ~/.bashrc:

Terminal window
cclocal() {
local port=8123
if [[ "$1" =~ ^[0-9]+$ ]]; then
port="$1"
shift
fi
(
export ANTHROPIC_BASE_URL="http://127.0.0.1:${port}"
claude "$@"
)
}

Usage:

Terminal window
cclocal # connect to localhost:8123
cclocal 8124 # connect to localhost:8124
cclocal 8124 --resume abc123 # with extra args

Approximate token generation speeds measured inside Claude Code on an M1 Max (64 GB), with roughly 30—37K input context tokens (a typical Claude Code prompt). All models served via llama-server.

ModelActive Paramstg (tok/s)
Gemma-4-26B-A4B4B~40
Qwen3.6-35B-A3B3B~35
GPT-OSS-20B3.6B~17—38
Qwen3-30B-A3B3B~15—27
GLM-4.7-Flash3B~12—13
Qwen3.5-35B-A3B3B~12
Qwen3-Next-80B-A3B3B~3—5

tg = token generation (output speed). Models without benchmarks are omitted. Speeds vary with prompt length, quantization, and system load.

Uses the built-in preset with optimized settings:

Terminal window
llama-server --gpt-oss-20b-default --port 8123

Performance: ~17—38 tok/s generation on M1 Max.

Terminal window
llama-server \
-hf unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF \
--port 8124 \
-c 131072 \
-b 32768 \
-ub 1024 \
--parallel 1 \
--jinja \
--chat-template-file \
~/Git/llama.cpp/models/templates/Qwen3-Coder.jinja

Performance: ~15—27 tok/s generation on M1 Max.

Qwen3-Coder-30B-A3B Recommended

Section titled “Qwen3-Coder-30B-A3B ”

Uses the built-in preset with Q8_0 quantization (higher quality):

Terminal window
llama-server --fim-qwen-30b-default --port 8127

Downloads ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF automatically on first run.

Qwen3-Coder-Next-80B-A3B — Newest SOTA Coder

Section titled “Qwen3-Coder-Next-80B-A3B — Newest SOTA Coder”

The latest and most capable coding model from Qwen. 80B MoE with only 3B active parameters. Requires ~46 GB RAM.

Terminal window
llama-server \
-hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
--port 8130 \
-c 131072 \
-b 2048 \
-ub 1024 \
--parallel 1 \
-fa on \
--jinja \
--temp 1.0 \
--top-p 0.95 \
--top-k 40 \
--min-p 0.01
QuantSizeNotes
UD-Q4_K_XL~46 GBRecommended for 64 GB systems

Qwen3.5-35B-A3B — Smart General-Purpose MoE

Section titled “Qwen3.5-35B-A3B — Smart General-Purpose MoE”

A 35B MoE model with 3B active parameters. Uses sliding window attention (SWA), which requires the --swa-full flag to enable prompt caching across follow-up requests. Without it, every request reprocesses the full prompt from scratch.

Terminal window
llama-server \
-hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M \
--port 8131 \
-c 131072 \
-b 2048 \
-ub 1024 \
--parallel 1 \
-fa on \
--jinja \
--keep 1024 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--swa-full \
--no-context-shift \
--chat-template-kwargs '{"enable_thinking": false}' \
--mlock \
--no-mmap

Critical settings:

SettingWhy
--chat-template-kwargs ...Disables thinking mode — avoids wasted tokens on reasoning, recommended for agentic workflows
--swa-fullExpands SWA cache to full context, enabling prompt caching (uses more RAM)
--no-context-shiftRequired — context shift is incompatible with SWA
--cache-type-k/v q8_0”Basically free” quality-wise, boosts throughput
--keep 1024Keeps system prompt prefix in cache
--mlock --no-mmapmacOS memory optimization

Performance (M1 Max 64 GB):

  • Cached follow-ups: ~3 seconds
  • Prompt eval: ~374—408 tok/s
  • Generation: ~12 tok/s
QuantSizeNotes
Q4_K_M~20 GBGood balance, recommended
UD-Q4_K_XL~21 GBSlightly better quality
UD-Q5_K_XL~25 GBHigher quality, slower

Qwen3-Next-80B-A3B — Better Long Context

Section titled “Qwen3-Next-80B-A3B — Better Long Context”

Slower generation but performance does not degrade as much with long contexts:

Terminal window
llama-server \
-hf unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_XL \
--port 8126 \
-c 131072 \
-b 32768 \
-ub 1024 \
--parallel 1 \
--jinja

Performance: ~5x slower generation than Qwen3-30B-A3B, but better on long contexts.

Nemotron-3-Nano-30B-A3B — NVIDIA Reasoning Model

Section titled “Nemotron-3-Nano-30B-A3B — NVIDIA Reasoning Model”
Terminal window
llama-server \
-hf unsloth/Nemotron-3-Nano-30B-A3B-GGUF:Q4_K_XL \
--port 8125 \
-c 131072 \
-b 32768 \
-ub 1024 \
--parallel 1 \
--jinja \
--chat-template-file \
~/Git/llama.cpp/models/templates/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.jinja \
--temp 0.6 \
--top-p 0.95 \
--min-p 0.01

Recommended settings from NVIDIA:

  • Tool calling: temp=0.6, top_p=0.95
  • Reasoning tasks: temp=1.0, top_p=1.0

A capable 30B MoE model from Zhipu AI. Requires a custom chat template.

Terminal window
llama-server \
-hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL \
--port 8129 \
-c 131072 \
-b 2048 \
-ub 1024 \
--parallel 1 \
-fa on \
--jinja \
--chat-template-file \
~/Git/llama.cpp/models/templates/glm-4.jinja

For higher quality, use Q8_0 (~32 GB, 20—40% slower):

Terminal window
llama-server \
-hf unsloth/GLM-4.7-Flash-GGUF:Q8_0 \
--port 8129 \
-c 131072 \
-b 2048 \
-ub 1024 \
--parallel 1 \
-fa on \
--jinja \
--chat-template-file \
~/Git/llama.cpp/models/templates/glm-4.jinja

Critical settings:

SettingWhy
--jinjaRequired for correct chat template
--chat-template-fileGLM-4 specific template
-fa onFlash attention for faster prompts
-b 2048Smaller batch works better here

Performance (M1 Max 64 GB):

  • Cold start: ~14 seconds
  • Cached follow-ups: ~4—5 seconds
  • Prompt eval: ~68—388 tok/s
  • Generation: ~12—13 tok/s
QuantSizeNotes
UD-Q4_K_XL~18 GBGood balance, recommended
Q8_0~32 GBHigher quality, 20—40% slower

Gemma-4-26B-A4B — Google MoE with Vision

Section titled “Gemma-4-26B-A4B — Google MoE with Vision”

A 26B MoE model from Google with only 4B active parameters. Supports up to 256K context. Optionally supports vision via a multimodal projector (mmproj).

Terminal window
llama-server \
-hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
--port 8132 \
-c 131072 \
-b 2048 \
-ub 1024 \
--parallel 1 \
-fa on \
--jinja \
--temp 1.0 \
--top-p 0.95 \
--top-k 64

Key settings:

SettingWhy
--temp 1.0Recommended by Google
--top-k 64Gemma-specific sampling parameter
-c 131072128K context; Claude Code needs 20k+ for system prompt alone
-fa onFlash attention for faster prompt processing

Performance (M1 Max 64 GB, ~37K input tokens):

pp = prompt processing, tg = token generation.

  • Cold start: pp 395 tok/s, tg 40 tok/s (~96s total)
  • Cached follow-up: tg 40 tok/s (~6s total, prompt cached in ~0.4s)
QuantSizeNotes
UD-Q4_K_XL~16 GBRecommended, fits comfortably on 64 GB systems

A 35B MoE model with 3B active parameters. Successor to Qwen3.5-35B-A3B with vision support. Uses sliding window attention (SWA).

Terminal window
llama-server \
-hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL \
--port 8133 \
-ngl 999 \
--threads 8 \
-c 65536 \
-b 2048 \
-ub 1024 \
--parallel 1 \
-fa on \
--jinja \
--keep 1024 \
--swa-full \
--no-context-shift \
--chat-template-kwargs '{"enable_thinking": false}' \
--temp 0.7 \
--top-p 0.8 \
--top-k 20 \
--min-p 0.00 \
--no-mmap

Critical settings:

SettingWhy
No --cache-type-k/vDo not use q8_0 KV cache — it kills tg from ~35 to ~12 tok/s. Use default f16 cache.
--swa-fullExpands SWA cache to full context, enabling prompt caching
--no-context-shiftRequired — context shift is incompatible with SWA
--chat-template-kwargs ...Disables thinking mode for agentic workflows
-c 6553664K context — enough for Claude Code, avoids the RAM cost of 128K

Performance (M1 Max 64 GB, ~41K input tokens):

pp = prompt processing, tg = token generation.

  • Cold start: pp 575 tok/s, tg 35 tok/s (~79s total)
  • Cached follow-up: tg 35 tok/s (~8s total)
QuantSizeNotes
UD-Q4_K_XL~23 GBRecommended
UD-Q4_K_M~22 GBSlightly smaller
UD-Q4_K_S~21 GBSmallest Q4, marginal quality loss
ModelPortCommand
GPT-OSS-20B8123llama-server --gpt-oss-20b-default --port 8123
Qwen3-30B-A3B8124See full command above
Nemotron-3-Nano8125See full command above
Qwen3-Next-80B8126See full command above
Qwen3-Coder-30B8127llama-server --fim-qwen-30b-default --port 8127
Qwen3-VL-30B8128See Vision Models
GLM-4.7-Flash8129See full command above
Qwen3-Coder-Next8130See full command above (~46 GB)
Qwen3.5-35B-A3B8131See full command above (needs --swa-full)
Gemma-4-26B-A4B8132See full command above
Qwen3.6-35B-A3B8133See full command above (no q8_0 cache!)

Codex CLI supports image inputs (-i / --image), and llama-server can serve vision-language models like Qwen3-VL. This enables local multimodal inference.

Vision models require two GGUF files: the main model plus a multimodal projector (mmproj).

  1. Download the mmproj file (one-time):

    Terminal window
    mkdir -p ~/models
    hf download \
    Qwen/Qwen3-VL-30B-A3B-Instruct-GGUF \
    mmproj-Qwen3-VL-30B-A3B-Instruct-f16.gguf \
    --local-dir ~/models
  2. Start the server (port 8128):

    Terminal window
    llama-server \
    -hf Qwen/Qwen3-VL-30B-A3B-Instruct-GGUF:Q4_K_M \
    --mmproj ~/models/mmproj-Qwen3-VL-30B-A3B-Instruct-f16.gguf \
    --port 8128 \
    -c 32768 \
    -b 2048 \
    -ub 2048 \
    --parallel 1 \
    --jinja
  3. Add the provider to ~/.codex/config.toml:

    [model_providers.llama-8128]
    name = "Qwen3-VL Vision"
    base_url = "http://localhost:8128/v1"
    wire_api = "chat"
  4. Run Codex with an image:

    Terminal window
    codex --model qwen3-vl \
    -c model_provider=llama-8128 \
    -i screenshot.png "describe this"

Gemma-4 also supports vision via a BF16 multimodal projector.

  1. Download the mmproj file (one-time):

    Terminal window
    mkdir -p ~/models
    hf download \
    unsloth/gemma-4-26B-A4B-it-GGUF \
    mmproj-BF16.gguf \
    --local-dir ~/models
  2. Start the server (port 8132):

    Terminal window
    llama-server \
    -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
    --mmproj ~/models/mmproj-BF16.gguf \
    --port 8132 \
    -c 32768 \
    -b 2048 \
    -ub 1024 \
    --parallel 1 \
    -fa on \
    --jinja \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 64
  3. Add the provider to ~/.codex/config.toml:

    [model_providers.llama-8132]
    name = "Gemma-4 Vision"
    base_url = "http://localhost:8132/v1"
    wire_api = "chat"
  4. Run Codex with an image:

    Terminal window
    codex --model gemma-4 \
    -c model_provider=llama-8132 \
    -i screenshot.png "describe this"

Increase context size (-c) or reduce parallel slots (--parallel 1). Claude Code sends large system prompts (~20k+ tokens).

  • Increase batch size: -b 32768
  • Reduce parallel slots: --parallel 1
  • Check if model is fully loaded in RAM/VRAM

Ensure you are using the correct chat template for your model. The template handles formatting the Anthropic API messages into the model’s expected format.

This is normal — the model loads into memory on first request (~10—30 seconds depending on model size). Subsequent requests are fast.