Local-first

Local agents,
the easy way.

locca sets up llama.cpp, helps you pick GGUF models that fit your hardware, benchmarks them, and wires the pi coding agent (or any OpenAI-compatible client) to whatever you're running. One CLI, no flag spelunking.

npm install -g @zeiq/locca

ai setup prompt

  ██╗      ██████╗  ██████╗ ██████╗ █████╗
  ██║     ██╔═══██╗██╔════╝██╔════╝██╔══██╗
  ██║     ██║   ██║██║     ██║     ███████║
  ███████╗╚██████╔╝╚██████╗╚██████╗██║  ██║
  ╚══════╝ ╚═════╝  ╚═════╝ ╚═════╝╚═╝  ╚═╝

  ● attached  llama-server on :8080
  Qwen3.5-9B-GGUF
  ctx 131,072 · 1 slot
◆  what next?
└
  ↑↓ to move · Enter to run · click any item · Esc to go back

$ locca pi
● starting llama-server (auto)…
✓ ready · :8080 · Qwen3.5-9B · ctx 131,072

  [pi] coding agent · model: qwen3.5-9b
  › organise my files in this directory
  ⠋ thinking… (warm cache, 142 tok/s)

  I see 38 files — mostly screenshots, a few
  PDFs, and some loose .zip archives. Plan:

    ✓ mkdir screenshots/ docs/ archives/
    ✓ mv *.png *.jpg → screenshots/  (24)
    ✓ mv *.pdf       → docs/         (7)
    ✓ mv *.zip *.tar → archives/     (4)

  Run it?  y  / n

◆  Switch to which model?
│  ── Installed ──
│  ● Qwen3.5-9B-Instruct.Q4_K_M    fits — 5.4 GB · 131k
│  ○ Llama-3.1-8B-Instruct.Q4_K_M  fits — 4.6 GB · 32k
│  ── Catalog ──
│  ○ Qwen3 32B Q4_K_M             too large — needs ~22 GB
│  ○ Mistral 7B Q5_K_M            fits — 5.1 GB · 32k
│  ○ Browse HuggingFace…
└

$ locca download qwen
◆  pick a model
│  results for "qwen"
│  ● Qwen/Qwen3.5-9B-Instruct-GGUF
│  ○ bartowski/Qwen3-32B-GGUF
│  ○ unsloth/Qwen2.5-Coder-32B-GGUF
└

↓ Qwen3.5-9B-Instruct-Q4_K_M.gguf
████████████████████░░░░░  78%   4.2/5.4 GB · 18 MB/s

$ locca bench Qwen3.5-9B
prompt 512 · generate 128 · 3 runs

run 1   pp 1842 t/s   tg 142 t/s
run 2   pp 1855 t/s   tg 143 t/s
run 3   pp 1851 t/s   tg 142 t/s
─────────────────────────────────
median  pp 1851 t/s   tg 142 t/s

$ locca doctor
✓ hardware     AMD Strix Halo · Radeon 890M · 96 GB
✓ llama-server v0.4.1 (Vulkan, n_gpu_layers=999)
✓ pi           v0.71.2
✓ server       :8080 · Qwen3.5-9B · 1 slot
✓ models dir   ~/models · 4 models · 21.3 GB
○ log size     342 KB

all systems green.

Why locca

Defaults that respect your hardware.

Most local-LLM tools either bury llama.cpp under abstraction or dump every flag in your lap. locca picks defaults that work. When you want to override them, the flags are right there.

llama.cpp, handled

locca install-llama drops a prebuilt binary into ~/.locca/bin — auto-detects Vulkan, CUDA, HIP, Metal, or CPU. No compiler, no sudo. Runs with tuned defaults: flash attention, quantized KV cache, batch sizing, plus per-family sampler defaults and a stable HF model alias for any catalog model. noMmap opt-in for Strix Halo / Ryzen AI MAX+.

Catalog with fit hints

The first-run wizard and locca switch show every curated model with a fits — 5.6 GB dl, 14.3 GB RAM, 256k ctx hint based on your detected hardware. No more 30 GB downloads that won't run.

Search & download

locca search qwen fuzzy-searches HuggingFace, locca download pulls a GGUF straight into your models dir, locca delete reclaims the disk. Vision adapters (mmproj*.gguf) auto-attach to their parent.

Bench in one command

locca bench wraps llama-bench with a friendlier summary — live tok/s and ctx during the run, results table at the end. Compare quants and ctx sizes without touching a flag.

Doctor & optimise

locca doctor sweeps hardware, server state, and the last 64 KiB of log for known issues — outdated chat templates, OOMs, ctx truncation. locca optimise hands the same data to pi and asks for concrete tweaks.

OpenAI-compatible

Point Cursor, Claude Code, or any OpenAI client at the local server. locca api prints every reachable LAN and Tailscale URL — probed live. Already running llama-server? locca detects it on /health, marks it attached, and uses it instead of spawning a duplicate.

Get going

Install.

One command, then a wizard. locca picks a models folder with you, fetches llama.cpp if it’s missing, downloads a starter GGUF, and installs the pi coding agent.

option a · cliinstall & run
$ npm install -g @zeiq/locca && locca
 
# first-run wizard:
#   1. picks a models folder
#   2. installs llama.cpp if missing
#   3. tunes port, ctx, threads, VRAM
#   4. offers a starter GGUF download
#   5. installs the pi coding agent
 
# re-run anytime: `locca setup`

option b · ailet an AI walk you through it
» Copy the setup prompt →
 
# paste into Claude / ChatGPT / Gemini /
# Cursor — it'll detect your OS, install
# Node if needed, run the npm install,
# and walk you through the wizard.
 
# works great with `claude` / `codex` /
# `gemini` CLIs — they can run the
# commands for you, end-to-end.

don't have llama.cpp? — the wizard handles it, but you can also run it standalone:
$ locca install-llama
# downloads a prebuilt binary into ~/.locca/bin/
# auto-detects Vulkan / CUDA / HIP / Metal / CPU
# already on PATH? locca uses that. update: --update

Surface

A small set of commands.

Run locca with no args for the menu, or jump straight to what you need.

locca piLaunch the pi coding agent against your local server.
locca serveStart llama-server with a picked model, detached.
locca switchCatalog-aware picker — installed models + curated catalog with fit hints.
locca benchRun llama-bench with a friendlier summary.
locca doctorHealth check — hardware, server, log warnings, config sanity.
locca optimiseHave pi review the deployment and rank concrete tweaks.
locca apiPrint OpenAI-compatible connection info + LAN URLs.
locca logsTail the server log (locca-spawned servers only).
locca downloadPull a GGUF from HuggingFace into your models dir.
locca searchFuzzy-search HuggingFace for GGUF models.
locca deleteRemove a model directory you no longer need.
locca stopStop the running server.
locca install-llamaDownload / update a prebuilt llama.cpp binary into ~/.locca/bin. Auto-detects backend.
locca configView / edit settings — get, set, reset, list, path.
locca setupRe-run the setup wizard.

Bonus

One command into the pi coding agent.

locca pi qwen fuzzy-matches the first *qwen*.gguf in your models dir, brings up the server if it isn’t already running, and registers itself as a custom OpenAI-compatible provider in ~/.pi/agent/models.json. Switch model, switch brain — locca switch gpt-oss-20b.

Where pi keeps its stuff.

Once locca pi drops you into the agent, these are the paths worth knowing. Global config lives under ~/.pi/agent/; per-project overrides go in .pi/ at your repo root.

~/.pi/agent/settings.jsonModel, theme, thinking level, retries, telemetry. Project overrides at .pi/settings.json.
~/.pi/agent/models.jsonCustom OpenAI-compatible providers. locca owns the locca entry and rewrites it on every locca pi — leave the rest alone.
~/.pi/agent/skills/Drop in pi-skills packages — each one a folder with a SKILL.md. Project skills also load from .pi/skills/ and ancestors up to the git root.
~/.pi/agent/AGENTS.mdGlobal instructions loaded at startup. Per-project AGENTS.md files in cwd or any ancestor merge in too.
~/.pi/agent/SYSTEM.mdReplaces the default system prompt entirely. Use APPEND_SYSTEM.md if you only want to tack things on.
~/.pi/agent/prompts/Reusable prompt templates. Drop a foo.md in here and run it mid-session with /foo.
~/.pi/agent/extensions/TypeScript modules registering custom tools, slash commands, and UI panels.
~/.pi/agent/keybindings.jsonOverride key bindings if the defaults clash with your terminal.
~/.pi/agent/sessions/JSONL session logs grouped by working directory — handy for resuming or grepping a past chat.

tip — skills, AGENTS.md, and prompts hot-reload. Drop a file, run locca pi, no restart needed.
$ pi install git:https://github.com/badlogic/pi-skills
# pulls a curated bundle into ~/.pi/agent/git/ and wires it
# into ~/.pi/agent/skills/. Browse the catalog at the repo above.