Skip to content

Local LLMs

Local models (20B—80B parameters) are not suited for complex coding tasks where frontier models excel. They are useful for:

  • Summarization and Q&A over private notes
  • Working with sensitive documents that cannot be sent to external APIs
  • High-volume tasks where API costs would add up
  • Fully offline or air-gapped environments
  1. Start llama-server with a model — this makes the model available at a local endpoint (e.g. port 8123)
  2. Run Claude Code (or Codex CLI) pointing to that endpoint

Claude Code uses the Anthropic-compatible /v1/messages endpoint. Codex CLI uses the OpenAI-compatible /v1/chat/completions endpoint. Both are served by llama-server simultaneously.

  • llama.cpp built with llama-server in your PATH
  • Sufficient RAM (64 GB+ recommended for 30B+ models)
  • Models download automatically from HuggingFace on first run

At its simplest, connecting Claude Code to a local model is one line:

Terminal window
ANTHROPIC_BASE_URL=http://127.0.0.1:8123 claude

The helper function below is a convenience wrapper. Add it to your ~/.zshrc or ~/.bashrc:

Terminal window
cclocal() {
local port=8123
if [[ "$1" =~ ^[0-9]+$ ]]; then
port="$1"
shift
fi
(
export ANTHROPIC_BASE_URL="http://127.0.0.1:${port}"
claude "$@"
)
}

Usage:

Terminal window
cclocal # connect to localhost:8123
cclocal 8124 # connect to localhost:8124
cclocal 8124 --resume abc123 # with extra args

Uses the built-in preset with optimized settings:

Terminal window
llama-server --gpt-oss-20b-default --port 8123

Performance: ~17—38 tok/s generation on M1 Max.

Terminal window
llama-server \
-hf unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF \
--port 8124 \
-c 131072 \
-b 32768 \
-ub 1024 \
--parallel 1 \
--jinja \
--chat-template-file \
~/Git/llama.cpp/models/templates/Qwen3-Coder.jinja

Performance: ~15—27 tok/s generation on M1 Max.

Qwen3-Coder-30B-A3B Recommended

Section titled “Qwen3-Coder-30B-A3B ”

Uses the built-in preset with Q8_0 quantization (higher quality):

Terminal window
llama-server --fim-qwen-30b-default --port 8127

Downloads ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF automatically on first run.

Qwen3-Coder-Next-80B-A3B — Newest SOTA Coder

Section titled “Qwen3-Coder-Next-80B-A3B — Newest SOTA Coder”

The latest and most capable coding model from Qwen. 80B MoE with only 3B active parameters. Requires ~46 GB RAM.

Terminal window
llama-server \
-hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
--port 8130 \
-c 131072 \
-b 2048 \
-ub 1024 \
--parallel 1 \
-fa on \
--jinja \
--temp 1.0 \
--top-p 0.95 \
--top-k 40 \
--min-p 0.01
QuantSizeNotes
UD-Q4_K_XL~46 GBRecommended for 64 GB systems

Qwen3-Next-80B-A3B — Better Long Context

Section titled “Qwen3-Next-80B-A3B — Better Long Context”

Slower generation but performance does not degrade as much with long contexts:

Terminal window
llama-server \
-hf unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_XL \
--port 8126 \
-c 131072 \
-b 32768 \
-ub 1024 \
--parallel 1 \
--jinja

Performance: ~5x slower generation than Qwen3-30B-A3B, but better on long contexts.

Nemotron-3-Nano-30B-A3B — NVIDIA Reasoning Model

Section titled “Nemotron-3-Nano-30B-A3B — NVIDIA Reasoning Model”
Terminal window
llama-server \
-hf unsloth/Nemotron-3-Nano-30B-A3B-GGUF:Q4_K_XL \
--port 8125 \
-c 131072 \
-b 32768 \
-ub 1024 \
--parallel 1 \
--jinja \
--chat-template-file \
~/Git/llama.cpp/models/templates/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.jinja \
--temp 0.6 \
--top-p 0.95 \
--min-p 0.01

Recommended settings from NVIDIA:

  • Tool calling: temp=0.6, top_p=0.95
  • Reasoning tasks: temp=1.0, top_p=1.0

A capable 30B MoE model from Zhipu AI. Requires a custom chat template.

Terminal window
llama-server \
-hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL \
--port 8129 \
-c 131072 \
-b 2048 \
-ub 1024 \
--parallel 1 \
-fa on \
--jinja \
--chat-template-file \
~/Git/llama.cpp/models/templates/glm-4.jinja

For higher quality, use Q8_0 (~32 GB, 20—40% slower):

Terminal window
llama-server \
-hf unsloth/GLM-4.7-Flash-GGUF:Q8_0 \
--port 8129 \
-c 131072 \
-b 2048 \
-ub 1024 \
--parallel 1 \
-fa on \
--jinja \
--chat-template-file \
~/Git/llama.cpp/models/templates/glm-4.jinja

Critical settings:

SettingWhy
--jinjaRequired for correct chat template
--chat-template-fileGLM-4 specific template
-fa onFlash attention for faster prompts
-b 2048Smaller batch works better here

Performance (M1 Max 64 GB):

  • Cold start: ~14 seconds
  • Cached follow-ups: ~4—5 seconds
  • Prompt eval: ~68—388 tok/s
  • Generation: ~12—13 tok/s
QuantSizeNotes
UD-Q4_K_XL~18 GBGood balance, recommended
Q8_0~32 GBHigher quality, 20—40% slower
ModelPortCommand
GPT-OSS-20B8123llama-server --gpt-oss-20b-default --port 8123
Qwen3-30B-A3B8124See full command above
Nemotron-3-Nano8125See full command above
Qwen3-Next-80B8126See full command above
Qwen3-Coder-30B8127llama-server --fim-qwen-30b-default --port 8127
Qwen3-VL-30B8128See Vision Models
GLM-4.7-Flash8129See full command above
Qwen3-Coder-Next8130See full command above (~46 GB)

Codex CLI supports image inputs (-i / --image), and llama-server can serve vision-language models like Qwen3-VL. This enables local multimodal inference.

Vision models require two GGUF files: the main model plus a multimodal projector (mmproj).

  1. Download the mmproj file (one-time):

    Terminal window
    mkdir -p ~/models
    hf download \
    Qwen/Qwen3-VL-30B-A3B-Instruct-GGUF \
    mmproj-Qwen3-VL-30B-A3B-Instruct-f16.gguf \
    --local-dir ~/models
  2. Start the server (port 8128):

    Terminal window
    llama-server \
    -hf Qwen/Qwen3-VL-30B-A3B-Instruct-GGUF:Q4_K_M \
    --mmproj ~/models/mmproj-Qwen3-VL-30B-A3B-Instruct-f16.gguf \
    --port 8128 \
    -c 32768 \
    -b 2048 \
    -ub 2048 \
    --parallel 1 \
    --jinja
  3. Add the provider to ~/.codex/config.toml:

    [model_providers.llama-8128]
    name = "Qwen3-VL Vision"
    base_url = "http://localhost:8128/v1"
    wire_api = "chat"
  4. Run Codex with an image:

    Terminal window
    codex --model qwen3-vl \
    -c model_provider=llama-8128 \
    -i screenshot.png "describe this"

Increase context size (-c) or reduce parallel slots (--parallel 1). Claude Code sends large system prompts (~20k+ tokens).

  • Increase batch size: -b 32768
  • Reduce parallel slots: --parallel 1
  • Check if model is fully loaded in RAM/VRAM

Ensure you are using the correct chat template for your model. The template handles formatting the Anthropic API messages into the model’s expected format.

This is normal — the model loads into memory on first request (~10—30 seconds depending on model size). Subsequent requests are fast.