llama-swap is an open-source model swapping server for local AI stacks. It sits in front of OpenAI-compatible and Anthropic-compatible backends, then loads, unloads, and routes requests to the right local model on demand. The GitHub repository describes support for llama.cpp, vLLM, tabbyAPI, stable-diffusion.cpp, and other compatible servers, with a focus on running multiple generative AI models from one simple proxy.
The project is useful because local model workflows often become messy as soon as users want more than one model. A developer may run one server for coding, another for embeddings, another for image generation, and another for speech. llama-swap gives that environment a single entry point and a configuration file that maps model IDs to upstream commands. It can start a backend when a model is requested, proxy the API call, and unload models when they are no longer needed.
The source repository lists broad API compatibility. It covers OpenAI endpoints such as chat completions, responses, embeddings, models, speech, transcriptions, voices, and image generation or editing. It also supports Anthropic messages endpoints, llama.cpp reranking and completion endpoints, and stable-diffusion.cpp image endpoints. On top of proxying, llama-swap exposes its own UI, health endpoint, metrics endpoint, running model list, manual unload operations, and log streaming.
For builders, the strongest reason to use llama-swap is operational control. The project supports API key restriction, concurrent loading rules, model TTLs, Docker and Podman lifecycle commands, preload hooks, request filters, aliases, dynamic port assignment, macros, and model-name overrides. That makes it useful for local labs, workstation setups, self-hosted inference demos, and teams that want one OpenAI-compatible base URL while still moving between models and backends.
The install story is also friendly for a local infrastructure tool. The repository mentions Docker, Homebrew, WinGet, release binaries, and source builds. The project is written mostly in Go with a Svelte and TypeScript web UI, and the repo shows active releases. The main caveat is that llama-swap does not make weak hardware or overloaded GPUs disappear. Teams still need to size VRAM, tune model loading, protect local endpoints, and test failure behavior. OpenTools classifies it as AI infrastructure for local model routing, not as an LLM itself.
The best way to evaluate llama-swap is to put it in front of two or three local backends and run realistic requests through a single client configuration. Test model aliases, cold-start time, unload behavior, concurrent requests, log output, health checks, and GPU memory recovery after a model is stopped. If the proxy lets the team switch between coding, embeddings, image, and speech workloads without rewriting client code, it is doing its job.
llama-swap is especially helpful in workshops, local labs, and self-hosted internal tools where the model inventory changes often. It gives users one place to define model IDs and lifecycle commands, while preserving compatibility with popular OpenAI-style clients. The project should still be protected like any inference endpoint. Set API keys where appropriate, restrict network exposure, monitor resource use, and document which local models are allowed for which data classes.