Local LLMs

When to Use Local Models

Local models (20B—80B parameters) are not suited for complex coding tasks where frontier models excel. They are useful for:

Summarization and Q&A over private notes
Working with sensitive documents that cannot be sent to external APIs
High-volume tasks where API costs would add up
Fully offline or air-gapped environments

How It Works

Start llama-server with a model — this makes the model available at a local endpoint (e.g. port 8123)
Run Claude Code (or Codex CLI) pointing to that endpoint

Claude Code uses the Anthropic-compatible /v1/messages endpoint. Codex CLI uses the OpenAI-compatible /v1/chat/completions endpoint. Both are served by llama-server simultaneously.

Prerequisites

llama.cpp built with llama-server in your PATH
Sufficient RAM (64 GB+ recommended for 30B+ models)
Models download automatically from HuggingFace on first run

Shell Function

At its simplest, connecting Claude Code to a local model is one line:

ANTHROPIC_BASE_URL=http://127.0.0.1:8123 claude

The helper function below is a convenience wrapper. Add it to your ~/.zshrc or ~/.bashrc:

cclocal() {
    local port=8123
    if [[ "$1" =~ ^[0-9]+$ ]]; then
        port="$1"
        shift
    fi
    (
        export ANTHROPIC_BASE_URL="http://127.0.0.1:${port}"
        claude "$@"
    )
}

Usage:

cclocal              # connect to localhost:8123
cclocal 8124         # connect to localhost:8124
cclocal 8124 --resume abc123  # with extra args

Add these to ~/.claude/settings.json:

{
  "env": {
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1",
    "CLAUDE_CODE_ATTRIBUTION_HEADER": "0"
  }
}

Attribution header (CLAUDE_CODE_ATTRIBUTION_HEADER): Claude Code prepends an attribution header that changes on every request, invalidating the KV cache and forcing full prompt re-processing each time. Disabling it is critical for usable performance.
Telemetry (CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC): Without this, Claude Code sends telemetry requests to your local server, which returns 404s and retries aggressively — causing ephemeral port exhaustion on macOS and system-wide network failures.

Configuration

Add a local provider to ~/.codex/config.toml:

[model_providers.llama-local]
name = "Local LLM via llama.cpp"
base_url = "http://localhost:8123/v1"
wire_api = "chat"

For multiple ports (different models), define multiple providers:

[model_providers.llama-8123]
name = "Local LLM port 8123"
base_url = "http://localhost:8123/v1"
wire_api = "chat"

[model_providers.llama-8124]
name = "Local LLM port 8124"
base_url = "http://localhost:8124/v1"
wire_api = "chat"

Switching Models

Use the --model and -c flags to switch models without editing the TOML file:

# GPT-OSS-20B on port 8123
codex --model gpt-oss-20b \
  -c model_provider=llama-8123

# Qwen3-30B on port 8124
codex --model qwen3-30b \
  -c model_provider=llama-8124

You can also override nested config values:

codex --model gpt-oss-20b \
  -c model_provider=llama-local \
  -c model_providers.llama-local.base_url="http://localhost:8124/v1"

Notes

Codex uses /v1/chat/completions (OpenAI format), not /v1/messages (Anthropic format)
Both endpoints are served by llama-server simultaneously
The same model can serve both Claude Code and Codex at the same time

Performance Comparison

Approximate token generation speeds measured inside Claude Code on an M1 Max (64 GB), with roughly 30—37K input context tokens (a typical Claude Code prompt). All models served via llama-server.

Model	Active Params	tg (tok/s)
Gemma-4-26B-A4B	4B	~40
Qwen3.6-35B-A3B	3B	~35
GPT-OSS-20B	3.6B	~17—38
Qwen3-30B-A3B	3B	~15—27
GLM-4.7-Flash	3B	~12—13
Qwen3.5-35B-A3B	3B	~12
Qwen3-Next-80B-A3B	3B	~3—5

tg = token generation (output speed). Models without benchmarks are omitted. Speeds vary with prompt length, quantization, and system load.

Model Commands

GPT-OSS-20B — Fast, Good Baseline

Uses the built-in preset with optimized settings:

llama-server --gpt-oss-20b-default --port 8123

Performance: ~17—38 tok/s generation on M1 Max.

Qwen3-30B-A3B

llama-server \
  -hf unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF \
  --port 8124 \
  -c 131072 \
  -b 32768 \
  -ub 1024 \
  --parallel 1 \
  --jinja \
  --chat-template-file \
    ~/Git/llama.cpp/models/templates/Qwen3-Coder.jinja

Performance: ~15—27 tok/s generation on M1 Max.

Qwen3-Coder-30B-A3B Recommended

Uses the built-in preset with Q8_0 quantization (higher quality):

llama-server --fim-qwen-30b-default --port 8127

Downloads ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF automatically on first run.

Qwen3-Coder-Next-80B-A3B — Newest SOTA Coder

The latest and most capable coding model from Qwen. 80B MoE with only 3B active parameters. Requires ~46 GB RAM.

llama-server \
  -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
  --port 8130 \
  -c 131072 \
  -b 2048 \
  -ub 1024 \
  --parallel 1 \
  -fa on \
  --jinja \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 40 \
  --min-p 0.01

Quant	Size	Notes
UD-Q4_K_XL	~46 GB	Recommended for 64 GB systems

Qwen3.5-35B-A3B — Smart General-Purpose MoE

A 35B MoE model with 3B active parameters. Uses sliding window attention (SWA), which requires the --swa-full flag to enable prompt caching across follow-up requests. Without it, every request reprocesses the full prompt from scratch.

llama-server \
  -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M \
  --port 8131 \
  -c 131072 \
  -b 2048 \
  -ub 1024 \
  --parallel 1 \
  -fa on \
  --jinja \
  --keep 1024 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --swa-full \
  --no-context-shift \
  --chat-template-kwargs '{"enable_thinking": false}' \
  --mlock \
  --no-mmap

Critical settings:

Setting	Why
`--chat-template-kwargs ...`	Disables thinking mode — avoids wasted tokens on reasoning, recommended for agentic workflows
`--swa-full`	Expands SWA cache to full context, enabling prompt caching (uses more RAM)
`--no-context-shift`	Required — context shift is incompatible with SWA
`--cache-type-k/v q8_0`	”Basically free” quality-wise, boosts throughput
`--keep 1024`	Keeps system prompt prefix in cache
`--mlock --no-mmap`	macOS memory optimization

Performance (M1 Max 64 GB):

Cached follow-ups: ~3 seconds
Prompt eval: ~374—408 tok/s
Generation: ~12 tok/s

Quant	Size	Notes
Q4_K_M	~20 GB	Good balance, recommended
UD-Q4_K_XL	~21 GB	Slightly better quality
UD-Q5_K_XL	~25 GB	Higher quality, slower

Qwen3-Next-80B-A3B — Better Long Context

Slower generation but performance does not degrade as much with long contexts:

llama-server \
  -hf unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_XL \
  --port 8126 \
  -c 131072 \
  -b 32768 \
  -ub 1024 \
  --parallel 1 \
  --jinja

Performance: ~5x slower generation than Qwen3-30B-A3B, but better on long contexts.

Nemotron-3-Nano-30B-A3B — NVIDIA Reasoning Model

llama-server \
  -hf unsloth/Nemotron-3-Nano-30B-A3B-GGUF:Q4_K_XL \
  --port 8125 \
  -c 131072 \
  -b 32768 \
  -ub 1024 \
  --parallel 1 \
  --jinja \
  --chat-template-file \
    ~/Git/llama.cpp/models/templates/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.jinja \
  --temp 0.6 \
  --top-p 0.95 \
  --min-p 0.01

Recommended settings from NVIDIA:

Tool calling: temp=0.6, top_p=0.95
Reasoning tasks: temp=1.0, top_p=1.0

GLM-4.7-Flash — Zhipu AI 30B-A3B MoE

A capable 30B MoE model from Zhipu AI. Requires a custom chat template.

llama-server \
  -hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL \
  --port 8129 \
  -c 131072 \
  -b 2048 \
  -ub 1024 \
  --parallel 1 \
  -fa on \
  --jinja \
  --chat-template-file \
    ~/Git/llama.cpp/models/templates/glm-4.jinja

For higher quality, use Q8_0 (~32 GB, 20—40% slower):

llama-server \
  -hf unsloth/GLM-4.7-Flash-GGUF:Q8_0 \
  --port 8129 \
  -c 131072 \
  -b 2048 \
  -ub 1024 \
  --parallel 1 \
  -fa on \
  --jinja \
  --chat-template-file \
    ~/Git/llama.cpp/models/templates/glm-4.jinja

Critical settings:

Setting	Why
`--jinja`	Required for correct chat template
`--chat-template-file`	GLM-4 specific template
`-fa on`	Flash attention for faster prompts
`-b 2048`	Smaller batch works better here

Performance (M1 Max 64 GB):

Cold start: ~14 seconds
Cached follow-ups: ~4—5 seconds
Prompt eval: ~68—388 tok/s
Generation: ~12—13 tok/s

Quant	Size	Notes
UD-Q4_K_XL	~18 GB	Good balance, recommended
Q8_0	~32 GB	Higher quality, 20—40% slower

Gemma-4-26B-A4B — Google MoE with Vision

A 26B MoE model from Google with only 4B active parameters. Supports up to 256K context. Optionally supports vision via a multimodal projector (mmproj).

llama-server \
  -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
  --port 8132 \
  -c 131072 \
  -b 2048 \
  -ub 1024 \
  --parallel 1 \
  -fa on \
  --jinja \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 64

Key settings:

Setting	Why
`--temp 1.0`	Recommended by Google
`--top-k 64`	Gemma-specific sampling parameter
`-c 131072`	128K context; Claude Code needs 20k+ for system prompt alone
`-fa on`	Flash attention for faster prompt processing

Performance (M1 Max 64 GB, ~37K input tokens):

pp = prompt processing, tg = token generation.

Cold start: pp 395 tok/s, tg 40 tok/s (~96s total)
Cached follow-up: tg 40 tok/s (~6s total, prompt cached in ~0.4s)

Quant	Size	Notes
UD-Q4_K_XL	~16 GB	Recommended, fits comfortably on 64 GB systems

Qwen3.6-35B-A3B — Fast Qwen MoE

A 35B MoE model with 3B active parameters. Successor to Qwen3.5-35B-A3B with vision support. Uses sliding window attention (SWA).

llama-server \
  -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL \
  --port 8133 \
  -ngl 999 \
  --threads 8 \
  -c 65536 \
  -b 2048 \
  -ub 1024 \
  --parallel 1 \
  -fa on \
  --jinja \
  --keep 1024 \
  --swa-full \
  --no-context-shift \
  --chat-template-kwargs '{"enable_thinking": false}' \
  --temp 0.7 \
  --top-p 0.8 \
  --top-k 20 \
  --min-p 0.00 \
  --no-mmap

Critical settings:

Setting	Why
No `--cache-type-k/v`	Do not use `q8_0` KV cache — it kills tg from ~35 to ~12 tok/s. Use default f16 cache.
`--swa-full`	Expands SWA cache to full context, enabling prompt caching
`--no-context-shift`	Required — context shift is incompatible with SWA
`--chat-template-kwargs ...`	Disables thinking mode for agentic workflows
`-c 65536`	64K context — enough for Claude Code, avoids the RAM cost of 128K

Performance (M1 Max 64 GB, ~41K input tokens):

pp = prompt processing, tg = token generation.

Cold start: pp 575 tok/s, tg 35 tok/s (~79s total)
Cached follow-up: tg 35 tok/s (~8s total)

Quant	Size	Notes
UD-Q4_K_XL	~23 GB	Recommended
UD-Q4_K_M	~22 GB	Slightly smaller
UD-Q4_K_S	~21 GB	Smallest Q4, marginal quality loss

Quick Reference

Model	Port	Command
GPT-OSS-20B	8123	`llama-server --gpt-oss-20b-default --port 8123`
Qwen3-30B-A3B	8124	See full command above
Nemotron-3-Nano	8125	See full command above
Qwen3-Next-80B	8126	See full command above
Qwen3-Coder-30B	8127	`llama-server --fim-qwen-30b-default --port 8127`
Qwen3-VL-30B	8128	See Vision Models
GLM-4.7-Flash	8129	See full command above
Qwen3-Coder-Next	8130	See full command above (~46 GB)
Qwen3.5-35B-A3B	8131	See full command above (needs `--swa-full`)
Gemma-4-26B-A4B	8132	See full command above
Qwen3.6-35B-A3B	8133	See full command above (no q8_0 cache!)

Vision Models

Codex CLI supports image inputs (-i / --image), and llama-server can serve vision-language models like Qwen3-VL. This enables local multimodal inference.

Qwen3-VL-30B-A3B Setup

Vision models require two GGUF files: the main model plus a multimodal projector (mmproj).

Download the mmproj file (one-time):

mkdir -p ~/models
hf download \
  Qwen/Qwen3-VL-30B-A3B-Instruct-GGUF \
  mmproj-Qwen3-VL-30B-A3B-Instruct-f16.gguf \
  --local-dir ~/models

Start the server (port 8128):

llama-server \
  -hf Qwen/Qwen3-VL-30B-A3B-Instruct-GGUF:Q4_K_M \
  --mmproj ~/models/mmproj-Qwen3-VL-30B-A3B-Instruct-f16.gguf \
  --port 8128 \
  -c 32768 \
  -b 2048 \
  -ub 2048 \
  --parallel 1 \
  --jinja

Add the provider to ~/.codex/config.toml:

[model_providers.llama-8128]
name = "Qwen3-VL Vision"
base_url = "http://localhost:8128/v1"
wire_api = "chat"

Run Codex with an image:

codex --model qwen3-vl \
  -c model_provider=llama-8128 \
  -i screenshot.png "describe this"

Gemma-4-26B-A4B Vision Setup

Gemma-4 also supports vision via a BF16 multimodal projector.

Download the mmproj file (one-time):

mkdir -p ~/models
hf download \
  unsloth/gemma-4-26B-A4B-it-GGUF \
  mmproj-BF16.gguf \
  --local-dir ~/models

Start the server (port 8132):

llama-server \
  -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
  --mmproj ~/models/mmproj-BF16.gguf \
  --port 8132 \
  -c 32768 \
  -b 2048 \
  -ub 1024 \
  --parallel 1 \
  -fa on \
  --jinja \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 64

Add the provider to ~/.codex/config.toml:

[model_providers.llama-8132]
name = "Gemma-4 Vision"
base_url = "http://localhost:8132/v1"
wire_api = "chat"

Run Codex with an image:

codex --model gemma-4 \
  -c model_provider=llama-8132 \
  -i screenshot.png "describe this"

Troubleshooting

”failed to find a memory slot” errors

Increase context size (-c) or reduce parallel slots (--parallel 1). Claude Code sends large system prompts (~20k+ tokens).

Slow generation

Increase batch size: -b 32768
Reduce parallel slots: --parallel 1
Check if model is fully loaded in RAM/VRAM

Model not responding correctly

Ensure you are using the correct chat template for your model. The template handles formatting the Anthropic API messages into the model’s expected format.

First request is slow

This is normal — the model loads into memory on first request (~10—30 seconds depending on model size). Subsequent requests are fast.

Local LLMs

When to Use Local Models

How It Works

Prerequisites

Setup

Shell Function

Configuration

Switching Models

Notes

Performance Comparison

Model Commands

GPT-OSS-20B — Fast, Good Baseline

Qwen3-30B-A3B

Qwen3-Coder-30B-A3B Recommended

Qwen3-Coder-Next-80B-A3B — Newest SOTA Coder

Qwen3.5-35B-A3B — Smart General-Purpose MoE

Qwen3-Next-80B-A3B — Better Long Context

Nemotron-3-Nano-30B-A3B — NVIDIA Reasoning Model

GLM-4.7-Flash — Zhipu AI 30B-A3B MoE

Gemma-4-26B-A4B — Google MoE with Vision

Qwen3.6-35B-A3B — Fast Qwen MoE

Quick Reference

Vision Models

Qwen3-VL-30B-A3B Setup

Gemma-4-26B-A4B Vision Setup

Troubleshooting

”failed to find a memory slot” errors

Slow generation

Model not responding correctly

First request is slow

See Also