Local LLMs

When to Use Local Models

Local models (20B—80B parameters) are not suited for complex coding tasks where frontier models excel. They are useful for:

Summarization and Q&A over private notes
Working with sensitive documents that cannot be sent to external APIs
High-volume tasks where API costs would add up
Fully offline or air-gapped environments

How It Works

Start llama-server with a model — this makes the model available at a local endpoint (e.g. port 8123)
Run Claude Code (or Codex CLI) pointing to that endpoint

Claude Code uses the Anthropic-compatible /v1/messages endpoint. Codex CLI uses the OpenAI-compatible /v1/chat/completions endpoint. Both are served by llama-server simultaneously.

Prerequisites

llama.cpp built with llama-server in your PATH
Sufficient RAM (64 GB+ recommended for 30B+ models)
Models download automatically from HuggingFace on first run

Shell Function

At its simplest, connecting Claude Code to a local model is one line:

ANTHROPIC_BASE_URL=http://127.0.0.1:8123 claude

The helper function below is a convenience wrapper. Add it to your ~/.zshrc or ~/.bashrc:

cclocal() {
    local port=8123
    if [[ "$1" =~ ^[0-9]+$ ]]; then
        port="$1"
        shift
    fi
    (
        export ANTHROPIC_BASE_URL="http://127.0.0.1:${port}"
        claude "$@"
    )
}

Usage:

cclocal              # connect to localhost:8123
cclocal 8124         # connect to localhost:8124
cclocal 8124 --resume abc123  # with extra args

Add this to ~/.claude/settings.json:

{
  "env": {
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1"
  }
}

Without this, Claude Code sends telemetry requests to your local server, which returns 404s and retries aggressively — causing ephemeral port exhaustion on macOS and system-wide network failures.

Configuration

Add a local provider to ~/.codex/config.toml:

[model_providers.llama-local]
name = "Local LLM via llama.cpp"
base_url = "http://localhost:8123/v1"
wire_api = "chat"

For multiple ports (different models), define multiple providers:

[model_providers.llama-8123]
name = "Local LLM port 8123"
base_url = "http://localhost:8123/v1"
wire_api = "chat"

[model_providers.llama-8124]
name = "Local LLM port 8124"
base_url = "http://localhost:8124/v1"
wire_api = "chat"

Switching Models

Use the --model and -c flags to switch models without editing the TOML file:

# GPT-OSS-20B on port 8123
codex --model gpt-oss-20b \
  -c model_provider=llama-8123

# Qwen3-30B on port 8124
codex --model qwen3-30b \
  -c model_provider=llama-8124

You can also override nested config values:

codex --model gpt-oss-20b \
  -c model_provider=llama-local \
  -c model_providers.llama-local.base_url="http://localhost:8124/v1"

Notes

Codex uses /v1/chat/completions (OpenAI format), not /v1/messages (Anthropic format)
Both endpoints are served by llama-server simultaneously
The same model can serve both Claude Code and Codex at the same time

Model Commands

GPT-OSS-20B — Fast, Good Baseline

Uses the built-in preset with optimized settings:

llama-server --gpt-oss-20b-default --port 8123

Performance: ~17—38 tok/s generation on M1 Max.

Qwen3-30B-A3B

llama-server \
  -hf unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF \
  --port 8124 \
  -c 131072 \
  -b 32768 \
  -ub 1024 \
  --parallel 1 \
  --jinja \
  --chat-template-file \
    ~/Git/llama.cpp/models/templates/Qwen3-Coder.jinja

Performance: ~15—27 tok/s generation on M1 Max.

Qwen3-Coder-30B-A3B Recommended

Uses the built-in preset with Q8_0 quantization (higher quality):

llama-server --fim-qwen-30b-default --port 8127

Downloads ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF automatically on first run.

Qwen3-Coder-Next-80B-A3B — Newest SOTA Coder

The latest and most capable coding model from Qwen. 80B MoE with only 3B active parameters. Requires ~46 GB RAM.

llama-server \
  -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
  --port 8130 \
  -c 131072 \
  -b 2048 \
  -ub 1024 \
  --parallel 1 \
  -fa on \
  --jinja \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 40 \
  --min-p 0.01

Quant	Size	Notes
UD-Q4_K_XL	~46 GB	Recommended for 64 GB systems

Qwen3-Next-80B-A3B — Better Long Context

Slower generation but performance does not degrade as much with long contexts:

llama-server \
  -hf unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_XL \
  --port 8126 \
  -c 131072 \
  -b 32768 \
  -ub 1024 \
  --parallel 1 \
  --jinja

Performance: ~5x slower generation than Qwen3-30B-A3B, but better on long contexts.

Nemotron-3-Nano-30B-A3B — NVIDIA Reasoning Model

llama-server \
  -hf unsloth/Nemotron-3-Nano-30B-A3B-GGUF:Q4_K_XL \
  --port 8125 \
  -c 131072 \
  -b 32768 \
  -ub 1024 \
  --parallel 1 \
  --jinja \
  --chat-template-file \
    ~/Git/llama.cpp/models/templates/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.jinja \
  --temp 0.6 \
  --top-p 0.95 \
  --min-p 0.01

Recommended settings from NVIDIA:

Tool calling: temp=0.6, top_p=0.95
Reasoning tasks: temp=1.0, top_p=1.0

GLM-4.7-Flash — Zhipu AI 30B-A3B MoE

A capable 30B MoE model from Zhipu AI. Requires a custom chat template.

llama-server \
  -hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL \
  --port 8129 \
  -c 131072 \
  -b 2048 \
  -ub 1024 \
  --parallel 1 \
  -fa on \
  --jinja \
  --chat-template-file \
    ~/Git/llama.cpp/models/templates/glm-4.jinja

For higher quality, use Q8_0 (~32 GB, 20—40% slower):

llama-server \
  -hf unsloth/GLM-4.7-Flash-GGUF:Q8_0 \
  --port 8129 \
  -c 131072 \
  -b 2048 \
  -ub 1024 \
  --parallel 1 \
  -fa on \
  --jinja \
  --chat-template-file \
    ~/Git/llama.cpp/models/templates/glm-4.jinja

Critical settings:

Setting	Why
`--jinja`	Required for correct chat template
`--chat-template-file`	GLM-4 specific template
`-fa on`	Flash attention for faster prompts
`-b 2048`	Smaller batch works better here

Performance (M1 Max 64 GB):

Cold start: ~14 seconds
Cached follow-ups: ~4—5 seconds
Prompt eval: ~68—388 tok/s
Generation: ~12—13 tok/s

Quant	Size	Notes
UD-Q4_K_XL	~18 GB	Good balance, recommended
Q8_0	~32 GB	Higher quality, 20—40% slower

Quick Reference

Model	Port	Command
GPT-OSS-20B	8123	`llama-server --gpt-oss-20b-default --port 8123`
Qwen3-30B-A3B	8124	See full command above
Nemotron-3-Nano	8125	See full command above
Qwen3-Next-80B	8126	See full command above
Qwen3-Coder-30B	8127	`llama-server --fim-qwen-30b-default --port 8127`
Qwen3-VL-30B	8128	See Vision Models
GLM-4.7-Flash	8129	See full command above
Qwen3-Coder-Next	8130	See full command above (~46 GB)

Vision Models

Codex CLI supports image inputs (-i / --image), and llama-server can serve vision-language models like Qwen3-VL. This enables local multimodal inference.

Qwen3-VL-30B-A3B Setup

Vision models require two GGUF files: the main model plus a multimodal projector (mmproj).

Download the mmproj file (one-time):

mkdir -p ~/models
hf download \
  Qwen/Qwen3-VL-30B-A3B-Instruct-GGUF \
  mmproj-Qwen3-VL-30B-A3B-Instruct-f16.gguf \
  --local-dir ~/models

Start the server (port 8128):

llama-server \
  -hf Qwen/Qwen3-VL-30B-A3B-Instruct-GGUF:Q4_K_M \
  --mmproj ~/models/mmproj-Qwen3-VL-30B-A3B-Instruct-f16.gguf \
  --port 8128 \
  -c 32768 \
  -b 2048 \
  -ub 2048 \
  --parallel 1 \
  --jinja

Add the provider to ~/.codex/config.toml:

[model_providers.llama-8128]
name = "Qwen3-VL Vision"
base_url = "http://localhost:8128/v1"
wire_api = "chat"

Run Codex with an image:

codex --model qwen3-vl \
  -c model_provider=llama-8128 \
  -i screenshot.png "describe this"

Troubleshooting

”failed to find a memory slot” errors

Increase context size (-c) or reduce parallel slots (--parallel 1). Claude Code sends large system prompts (~20k+ tokens).

Slow generation

Increase batch size: -b 32768
Reduce parallel slots: --parallel 1
Check if model is fully loaded in RAM/VRAM

Model not responding correctly

Ensure you are using the correct chat template for your model. The template handles formatting the Anthropic API messages into the model’s expected format.

First request is slow

This is normal — the model loads into memory on first request (~10—30 seconds depending on model size). Subsequent requests are fast.