llama-swap screenshot

llama-swap

Local AI InfrastructureFree

llama-swap - Local AI Model Swapping Proxy Service

Last updated May 18, 2026

Claim Tool

What is llama-swap?

llama-swap is an open-source model swapping server for local AI stacks. It sits in front of OpenAI-compatible and Anthropic-compatible backends, then loads, unloads, and routes requests to the right local model on demand. The GitHub repository describes support for llama.cpp, vLLM, tabbyAPI, stable-diffusion.cpp, and other compatible servers, with a focus on running multiple generative AI models from one simple proxy. The project is useful because local model workflows often become messy as soon as users want more than one model. A developer may run one server for coding, another for embeddings, another for image generation, and another for speech. llama-swap gives that environment a single entry point and a configuration file that maps model IDs to upstream commands. It can start a backend when a model is requested, proxy the API call, and unload models when they are no longer needed. The source repository lists broad API compatibility. It covers OpenAI endpoints such as chat completions, responses, embeddings, models, speech, transcriptions, voices, and image generation or editing. It also supports Anthropic messages endpoints, llama.cpp reranking and completion endpoints, and stable-diffusion.cpp image endpoints. On top of proxying, llama-swap exposes its own UI, health endpoint, metrics endpoint, running model list, manual unload operations, and log streaming. For builders, the strongest reason to use llama-swap is operational control. The project supports API key restriction, concurrent loading rules, model TTLs, Docker and Podman lifecycle commands, preload hooks, request filters, aliases, dynamic port assignment, macros, and model-name overrides. That makes it useful for local labs, workstation setups, self-hosted inference demos, and teams that want one OpenAI-compatible base URL while still moving between models and backends. The install story is also friendly for a local infrastructure tool. The repository mentions Docker, Homebrew, WinGet, release binaries, and source builds. The project is written mostly in Go with a Svelte and TypeScript web UI, and the repo shows active releases. The main caveat is that llama-swap does not make weak hardware or overloaded GPUs disappear. Teams still need to size VRAM, tune model loading, protect local endpoints, and test failure behavior. OpenTools classifies it as AI infrastructure for local model routing, not as an LLM itself. The best way to evaluate llama-swap is to put it in front of two or three local backends and run realistic requests through a single client configuration. Test model aliases, cold-start time, unload behavior, concurrent requests, log output, health checks, and GPU memory recovery after a model is stopped. If the proxy lets the team switch between coding, embeddings, image, and speech workloads without rewriting client code, it is doing its job. llama-swap is especially helpful in workshops, local labs, and self-hosted internal tools where the model inventory changes often. It gives users one place to define model IDs and lifecycle commands, while preserving compatibility with popular OpenAI-style clients. The project should still be protected like any inference endpoint. Set API keys where appropriate, restrict network exposure, monitor resource use, and document which local models are allowed for which data classes.

llama-swap's Top Features

Key capabilities that make llama-swap stand out.

On-demand model loading and unloading through one proxy

OpenAI-compatible and Anthropic-compatible API routing

Works with llama.cpp, vLLM, tabbyAPI, stable-diffusion.cpp, and similar servers

Web UI, logs, health checks, metrics, and running model inspection

Configurable aliases, TTLs, request filters, macros, and lifecycle commands

Use Cases

Who benefits most from this tool.

Local model users

Run several local models behind one OpenAI-compatible base URL and swap them on demand.

AI infrastructure builders

Route requests across llama.cpp, vLLM, tabbyAPI, stable-diffusion.cpp, and other compatible servers.

Self-hosted AI teams

Add health checks, metrics, logs, unload controls, aliases, and lifecycle commands to a local inference workstation.

Tags

local-aillama-cppvllmmodel-routingopenai-compatibleanthropic-compatibleself-hosted-aigoinferencedeveloper-tools

llama-swap's Pricing

Free plan available

User Reviews

Share your thoughts

If you've used this product, share your thoughts with other builders

Recent reviews

Frequently Asked Questions

What is llama-swap?
llama-swap is an open-source proxy that hot-swaps local AI models and routes API requests to compatible backend servers.
Which APIs does it support?
The repository lists OpenAI-compatible endpoints, Anthropic messages endpoints, llama.cpp endpoints, and stable-diffusion.cpp image endpoints.
How do you install llama-swap?
The project mentions Docker, Homebrew, WinGet, release binaries, and source builds.
Is llama-swap an LLM?
No. It is infrastructure for routing and lifecycle management around local models, not a model family itself.