Switching From Ollama to llama.cpp For Local Inference

For a while there, my home lab server and I had a bad relationship.

The box is a headless Ubuntu machine with an RTX 5090 in it. It had been serving local models through Ollama for months. Most of the time it was fine. The model loaded, it answered questions, life was good. But the moment I asked an agent to actually do something, things got weird. I’d watch it confidently announce its plan. “I’ll search the database for that record.” And then nothing. No tool call. Just a model narrating its intentions like when I describe the workout I’m definitely going to do tomorrow.

Sometimes the tool call worked. Sometimes it didn’t. There was no pattern I could find, which is the worst kind of bug. You can fix consistent. You cannot fix “depends on the vibes.” I’d rerun the exact same prompt and get two different behaviors, and slowly lose my mind. Same model the whole time, by the way. Gemma 4 31B, before and after. So this was never about the weights. It was about everything wrapped around them.

So I ripped out Ollama and went to straight llama.cpp.

I’d been putting it off because Ollama is so easy and I assumed the alternative meant a weekend of fighting build flags. It did take some setup. But the payoff was immediate and honestly kind of stunning: the tool-calling problem just disappeared. Not “improved.” Disappeared. The model calls the tool when it should call the tool. Every time. I keep waiting for it to flake out and it hasn’t.

My theory, for what it’s worth: Ollama wraps a lot of stuff for you, and somewhere in that abstraction the tool-call formatting was getting mangled often enough to break the agent loop. The model would intend to call a tool, the output wouldn’t get parsed as a tool call, and the agent would just treat it as a chat message and stop. Going closer to the metal meant fewer layers between the model and the thing reading its output. Less to go wrong. That’s the whole reason I wanted this in the first place: more control, my own build flags, explicit quant choices, real context tuning, one fewer thing standing between me and the engine.

Here’s how I actually did it, in case you’re staring down the same migration.

Look before you leap

Before touching anything, I took inventory, and this step paid for itself several times over. Ollama was on v0.30.7, running as a systemd service. Models were eating about 97 GB of blobs under /usr/share/ollama/.ollama. The GPU is a 5090, which matters for the llama.cpp build.

The piece that shaped everything: I run a little Node logging proxy on 0.0.0.0:11435 that every client actually talks to. It forwards each request to the engine, streams the response back untouched, and logs metadata to SQLite. Ollama itself was bound to localhost behind it. So the real constraint wasn’t “replace Ollama,” it was “replace Ollama without the proxy or any of its clients noticing.”

And because that proxy logs everything, I didn’t have to guess what depended on Ollama. I just queried the log database. Three things fell out of that. About 80% of traffic was already OpenAI-style /v1/chat/completions, which llama-server speaks natively, so the proxy would forward it unchanged. A burst of Ollama-native /api/chat calls turned out to be from a one-time setup day, nothing live. And one model was an Ollama cloud model, roughly a quarter of all calls, which simply doesn’t exist in a llama.cpp world. I made my peace with losing it.

That measurement turned a scary “rewrite everything” into a “change one port” job. Don’t trust your memory of what’s running. Read the traffic.

Building llama.cpp for a Blackwell GPU

The newer GPU arch is the part most likely to ruin your afternoon. The 5090 is sm_120, which needs a recent CUDA toolkit (12.8 for me) and an explicit build target. GCC 13 is the right host compiler for CUDA 12.8.

sudo apt install -y cmake libcurl4-openssl-dev ccache

cd ~
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=120 \   # Blackwell / RTX 5090
  -DLLAMA_CURL=ON \
  -DGGML_CCACHE=ON
cmake --build build --config Release -j

Then check the build actually saw the card:

$ ./build/bin/llama-server --list-devices
Available devices:
  CUDA0: NVIDIA GeForce RTX 5090 (32120 MiB, 31614 MiB free)

Later, in the load log, I got CUDA : ARCHS = 1200 and BLACKWELL_NATIVE_FP4 = 1, which told me it compiled against the native Blackwell paths and not some fallback. Don’t bother with prebuilt release binaries on a just-released GPU. They’re a gamble on whether sm_120 made it in. Building from source is the reliable path.

Getting the model

I started clean, no migrating old blobs. Main model is Gemma 4 31B, the dense instruction-tuned one. It’s Apache-2.0 and ungated, so no Hugging Face token dance, and it uses sliding-window attention on most layers, which makes its KV cache unusually cheap. That’s why it ships with a 256K context window.

I grabbed the quantization-aware-trained GGUF from unsloth (UD-Q4_K_XL, 17.3 GB). For Gemma, QAT at Q4 lands close to Q5/Q6 quality but about 5 GB smaller, which is real headroom on a 32 GB card.

One gotcha here. A recent llama.cpp change moved the -hf auto-downloader to a native HTTPS client that wants -DLLAMA_OPENSSL=ON at build time. Mine didn’t have it, so -hf couldn’t fetch. Rather than rebuild, I just pulled the file directly, since the system’s own wget has working TLS:

cd ~/models
wget -c "https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF/resolve/main/gemma-4-31B-it-qat-UD-Q4_K_XL.gguf"

VRAM is mostly KV cache

I went in assuming I’d have to be stingy with context. I was wrong, and measuring proved it. Your VRAM is model weights (fixed) plus KV cache (scales with context) plus compute buffers. The weights are a flat ~17.3 GB. Everything above that is context, and because Gemma’s KV is cheap, the full 128K fit fine. Here’s what I actually saw on the hardware:

Config	VRAM used	Free
32K context, 4 slots (cautious default)	23.4 GB	8.7 GB
128K context, 1 slot (f16 KV)	28.8 GB	3.3 GB
128K context + flash-attn + q8_0 KV	25.7 GB	6.5 GB

Two flags bought back the headroom: -fa on for flash attention, and -ctk q8_0 -ctv q8_0 to quantize the KV cache for basically no quality cost. That’s the difference between “3 GB free and nervous” and “6.5 GB free and comfortable” at full context.

Keeping “call any model by name”

Plain llama-server loads one model per process, and the one thing I’d actually miss about Ollama is requesting models by name and letting it handle the swap. llama-swap gives that back. It’s an OpenAI-compatible front-end that launches llama-server on demand, swaps models by name, and exposes /v1/models.

# ~/llama-swap/config.yaml
healthCheckTimeout: 300
logLevel: info

models:
  "gemma-4-31b-it":
    cmd: |
      /home/tduffy/llama.cpp/build/bin/llama-server
      -m /home/tduffy/models/gemma-4-31B-it-qat-UD-Q4_K_XL.gguf
      --host 127.0.0.1 --port ${PORT}
      -ngl 99 -c 131072 -fa on -ctk q8_0 -ctv q8_0 --jinja
    aliases:
      - "gemma4:31b"   # so old clients keep working unchanged
    ttl: 1800                        # unload after 30 min idle

That alias line is the quiet hero of the whole cutover. One LAN client was hardcoded to ask for the old model name. Aliasing that name to the new Gemma model meant that client needed zero changes. llama-swap binds to 127.0.0.1:9090, internal only, so the proxy stays the single public face.

The cutover was one port

Because the proxy is transparent and already parses both Ollama and OpenAI response shapes, the entire cutover was repointing its upstream from Ollama on :11434 to llama-swap on :9090. Both proxy and dashboard were already PM2-managed, so:

OLLAMA_PORT=9090 pm2 restart ollama-logging --update-env
pm2 save     # persist across reboots

Then one end-to-end test through the public endpoint:

curl -s http://127.0.0.1:11435/v1/chat/completions \
  -d '{"model":"gemma4:31b","messages":[{"role":"user","content":"Reply: CUTOVER OK"}]}'
# -> "CUTOVER OK"

A request to the old model name flowed client → proxy :11435 → llama-swap :9090 → Gemma, and the proxy logged it correctly. And up to this point everything was reversible. If anything broke, pointing the proxy back at Ollama was one command. I cannot stress enough how much calmer that makes the whole thing. Stand up the replacement, prove it works, then burn the boats.

Burning the boats

Once the replacement was verified, I did the irreversible part:

sudo systemctl disable --now ollama
sudo rm -f  /usr/local/bin/ollama
sudo rm -f  /etc/systemd/system/ollama.service
sudo rm -rf /etc/systemd/system/ollama.service.d
sudo systemctl daemon-reload
sudo userdel ollama
sudo rm -rf /usr/share/ollama        # ~97 GB
rm -rf ~/.ollama

That gave me back about 97 GB. (One harmless quirk: my shell kept printing the old ollama path afterward. That’s just bash’s command-hash cache. hash -r clears it.)

The part I almost forgot

The migration wasn’t done when Ollama was gone. Other things on the box were calling Ollama directly, bypassing my mental model entirely. Two Hermes agent gateways, one a systemd-user service and one a Docker container, had configs pointing straight at localhost:11434, which was now a dead port.

The fixes were small but necessary. Repoint each base_url from :11434 to the proxy on :11435, so their traffic gets logged too. One of them referenced a model name I’d deleted, so I pointed it at the new one. Restart each, then confirm in the proxy logs that their startup probes hit the new endpoint with a 200. Grep your whole system for the old endpoint before you declare victory. Anything that talked to :11434 directly is quietly broken until you go find it.

Where it stands now

End state: a from-source llama.cpp tuned for the 5090, multi-model serving through llama-swap, the logging layer fully intact, 97 GB back on disk, and agents that actually finish what they start.

That last one is the whole reason I bothered. My Hermes agent went from “promising but flaky” to something I trust to run a multi-step task without babysitting. OpenCode, which I’d basically given up on locally because the agentic loop kept breaking, is genuinely useful now. It picks up a task, calls what it needs, and finishes. That sounds like a low bar. It is a low bar. I just wasn’t clearing it before.

Same model the entire time. It was always this capable. It just finally gets to finish the job.