Local LLMs
When to Use Local Models
Section titled “When to Use Local Models”Local models (20B—80B parameters) are not suited for complex coding tasks where frontier models excel. They are useful for:
- Summarization and Q&A over private notes
- Working with sensitive documents that cannot be sent to external APIs
- High-volume tasks where API costs would add up
- Fully offline or air-gapped environments
How It Works
Section titled “How It Works”- Start
llama-serverwith a model — this makes the model available at a local endpoint (e.g. port 8123) - Run Claude Code (or Codex CLI) pointing to that endpoint
Claude Code uses the Anthropic-compatible
/v1/messages endpoint. Codex CLI uses the
OpenAI-compatible /v1/chat/completions endpoint.
Both are served by llama-server simultaneously.
Prerequisites
Section titled “Prerequisites”- llama.cpp
built with
llama-serverin yourPATH - Sufficient RAM (64 GB+ recommended for 30B+ models)
- Models download automatically from HuggingFace on first run
Shell Function
Section titled “Shell Function”At its simplest, connecting Claude Code to a local model is one line:
ANTHROPIC_BASE_URL=http://127.0.0.1:8123 claudeThe helper function below is a convenience wrapper.
Add it to your ~/.zshrc or ~/.bashrc:
cclocal() { local port=8123 if [[ "$1" =~ ^[0-9]+$ ]]; then port="$1" shift fi ( export ANTHROPIC_BASE_URL="http://127.0.0.1:${port}" claude "$@" )}Usage:
cclocal # connect to localhost:8123cclocal 8124 # connect to localhost:8124cclocal 8124 --resume abc123 # with extra argsConfiguration
Section titled “Configuration”Add a local provider to ~/.codex/config.toml:
[model_providers.llama-local]name = "Local LLM via llama.cpp"base_url = "http://localhost:8123/v1"wire_api = "chat"For multiple ports (different models), define multiple providers:
[model_providers.llama-8123]name = "Local LLM port 8123"base_url = "http://localhost:8123/v1"wire_api = "chat"
[model_providers.llama-8124]name = "Local LLM port 8124"base_url = "http://localhost:8124/v1"wire_api = "chat"Switching Models
Section titled “Switching Models”Use the --model and -c flags to switch models
without editing the TOML file:
# GPT-OSS-20B on port 8123codex --model gpt-oss-20b \ -c model_provider=llama-8123
# Qwen3-30B on port 8124codex --model qwen3-30b \ -c model_provider=llama-8124You can also override nested config values:
codex --model gpt-oss-20b \ -c model_provider=llama-local \ -c model_providers.llama-local.base_url="http://localhost:8124/v1"- Codex uses
/v1/chat/completions(OpenAI format), not/v1/messages(Anthropic format) - Both endpoints are served by
llama-serversimultaneously - The same model can serve both Claude Code and Codex at the same time
Model Commands
Section titled “Model Commands”GPT-OSS-20B — Fast, Good Baseline
Section titled “GPT-OSS-20B — Fast, Good Baseline”Uses the built-in preset with optimized settings:
llama-server --gpt-oss-20b-default --port 8123Performance: ~17—38 tok/s generation on M1 Max.
Qwen3-30B-A3B
Section titled “Qwen3-30B-A3B”llama-server \ -hf unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF \ --port 8124 \ -c 131072 \ -b 32768 \ -ub 1024 \ --parallel 1 \ --jinja \ --chat-template-file \ ~/Git/llama.cpp/models/templates/Qwen3-Coder.jinjaPerformance: ~15—27 tok/s generation on M1 Max.
Qwen3-Coder-30B-A3B Recommended
Section titled “Qwen3-Coder-30B-A3B ”Uses the built-in preset with Q8_0 quantization (higher quality):
llama-server --fim-qwen-30b-default --port 8127Downloads
ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF
automatically on first run.
Qwen3-Coder-Next-80B-A3B — Newest SOTA Coder
Section titled “Qwen3-Coder-Next-80B-A3B — Newest SOTA Coder”The latest and most capable coding model from Qwen. 80B MoE with only 3B active parameters. Requires ~46 GB RAM.
llama-server \ -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \ --port 8130 \ -c 131072 \ -b 2048 \ -ub 1024 \ --parallel 1 \ -fa on \ --jinja \ --temp 1.0 \ --top-p 0.95 \ --top-k 40 \ --min-p 0.01| Quant | Size | Notes |
|---|---|---|
| UD-Q4_K_XL | ~46 GB | Recommended for 64 GB systems |
Qwen3-Next-80B-A3B — Better Long Context
Section titled “Qwen3-Next-80B-A3B — Better Long Context”Slower generation but performance does not degrade as much with long contexts:
llama-server \ -hf unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_XL \ --port 8126 \ -c 131072 \ -b 32768 \ -ub 1024 \ --parallel 1 \ --jinjaPerformance: ~5x slower generation than Qwen3-30B-A3B, but better on long contexts.
Nemotron-3-Nano-30B-A3B — NVIDIA Reasoning Model
Section titled “Nemotron-3-Nano-30B-A3B — NVIDIA Reasoning Model”llama-server \ -hf unsloth/Nemotron-3-Nano-30B-A3B-GGUF:Q4_K_XL \ --port 8125 \ -c 131072 \ -b 32768 \ -ub 1024 \ --parallel 1 \ --jinja \ --chat-template-file \ ~/Git/llama.cpp/models/templates/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.jinja \ --temp 0.6 \ --top-p 0.95 \ --min-p 0.01Recommended settings from NVIDIA:
- Tool calling:
temp=0.6,top_p=0.95 - Reasoning tasks:
temp=1.0,top_p=1.0
GLM-4.7-Flash — Zhipu AI 30B-A3B MoE
Section titled “GLM-4.7-Flash — Zhipu AI 30B-A3B MoE”A capable 30B MoE model from Zhipu AI. Requires a custom chat template.
llama-server \ -hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL \ --port 8129 \ -c 131072 \ -b 2048 \ -ub 1024 \ --parallel 1 \ -fa on \ --jinja \ --chat-template-file \ ~/Git/llama.cpp/models/templates/glm-4.jinjaFor higher quality, use Q8_0 (~32 GB, 20—40% slower):
llama-server \ -hf unsloth/GLM-4.7-Flash-GGUF:Q8_0 \ --port 8129 \ -c 131072 \ -b 2048 \ -ub 1024 \ --parallel 1 \ -fa on \ --jinja \ --chat-template-file \ ~/Git/llama.cpp/models/templates/glm-4.jinjaCritical settings:
| Setting | Why |
|---|---|
--jinja | Required for correct chat template |
--chat-template-file | GLM-4 specific template |
-fa on | Flash attention for faster prompts |
-b 2048 | Smaller batch works better here |
Performance (M1 Max 64 GB):
- Cold start: ~14 seconds
- Cached follow-ups: ~4—5 seconds
- Prompt eval: ~68—388 tok/s
- Generation: ~12—13 tok/s
| Quant | Size | Notes |
|---|---|---|
| UD-Q4_K_XL | ~18 GB | Good balance, recommended |
| Q8_0 | ~32 GB | Higher quality, 20—40% slower |
Quick Reference
Section titled “Quick Reference”| Model | Port | Command |
|---|---|---|
| GPT-OSS-20B | 8123 | llama-server --gpt-oss-20b-default --port 8123 |
| Qwen3-30B-A3B | 8124 | See full command above |
| Nemotron-3-Nano | 8125 | See full command above |
| Qwen3-Next-80B | 8126 | See full command above |
| Qwen3-Coder-30B | 8127 | llama-server --fim-qwen-30b-default --port 8127 |
| Qwen3-VL-30B | 8128 | See Vision Models |
| GLM-4.7-Flash | 8129 | See full command above |
| Qwen3-Coder-Next | 8130 | See full command above (~46 GB) |
Vision Models
Section titled “Vision Models”Codex CLI supports image inputs (-i / --image),
and llama-server can serve vision-language models
like Qwen3-VL. This enables local multimodal
inference.
Qwen3-VL-30B-A3B Setup
Section titled “Qwen3-VL-30B-A3B Setup”Vision models require two GGUF files: the main model plus a multimodal projector (mmproj).
-
Download the mmproj file (one-time):
Terminal window mkdir -p ~/modelshf download \Qwen/Qwen3-VL-30B-A3B-Instruct-GGUF \mmproj-Qwen3-VL-30B-A3B-Instruct-f16.gguf \--local-dir ~/models -
Start the server (port 8128):
Terminal window llama-server \-hf Qwen/Qwen3-VL-30B-A3B-Instruct-GGUF:Q4_K_M \--mmproj ~/models/mmproj-Qwen3-VL-30B-A3B-Instruct-f16.gguf \--port 8128 \-c 32768 \-b 2048 \-ub 2048 \--parallel 1 \--jinja -
Add the provider to
~/.codex/config.toml:[model_providers.llama-8128]name = "Qwen3-VL Vision"base_url = "http://localhost:8128/v1"wire_api = "chat" -
Run Codex with an image:
Terminal window codex --model qwen3-vl \-c model_provider=llama-8128 \-i screenshot.png "describe this"
Troubleshooting
Section titled “Troubleshooting””failed to find a memory slot” errors
Section titled “”failed to find a memory slot” errors”Increase context size (-c) or reduce parallel
slots (--parallel 1). Claude Code sends large
system prompts (~20k+ tokens).
Slow generation
Section titled “Slow generation”- Increase batch size:
-b 32768 - Reduce parallel slots:
--parallel 1 - Check if model is fully loaded in RAM/VRAM
Model not responding correctly
Section titled “Model not responding correctly”Ensure you are using the correct chat template for your model. The template handles formatting the Anthropic API messages into the model’s expected format.
First request is slow
Section titled “First request is slow”This is normal — the model loads into memory on first request (~10—30 seconds depending on model size). Subsequent requests are fast.
See Also
Section titled “See Also”- Alternative LLM Providers — cloud-hosted Anthropic-compatible providers
- Chutes Integration — OpenAI-compatible Chutes provider