whichllm is a command-line model picker for people running LLMs locally. Instead of asking only “what is the biggest model I can fit,” it tries to answer the more useful question: which local model should I actually run on this machine? The README says it ranks models by real, recency-aware benchmarks rather than parameter count, while also accounting for VRAM fit and likely speed.
The workflow is simple. Users can run it through uvx, install it with uv, Homebrew, or pip, and then ask for recommendations for the detected machine or a simulated GPU. The README shows examples for RTX 4090, multi-GPU workstations, GPU-only filtering, speed thresholds, markdown output, JSON output, upgrade comparisons, planning hardware for a target model, starting a chat, and printing copy-paste Python. That makes it useful both as an interactive CLI and as a small automation primitive.
whichllm is builder-relevant because local model choice is messy. A model can technically load but be too slow, too old, or worse on the tasks a developer cares about. The project says it combines sources such as LiveBench, Artificial Analysis, Aider, multimodal or vision benchmarks, Chatbot Arena ELO, and Open LLM Leaderboard data. The exact scoring should still be reviewed from the repository, but the tool’s goal is clear: translate benchmark and hardware data into a practical shortlist.
Pricing is free as an open-source project, but local LLMs still have hardware costs. Teams using whichllm should treat it as a decision aid, not an absolute ranking. Validate recommendations with your own prompts, latency targets, memory limits, and deployment stack before buying GPUs or standardizing on a model family.
The tool is strongest during exploration and hardware planning. A developer can quickly compare whether a smaller current model beats an older larger model on a given card, or whether a GPU upgrade changes the set of practical options. Scriptable JSON output also means teams can fold model recommendations into internal docs, benchmark reports, or setup scripts.
The limitation is that no benchmark mix perfectly predicts your workload. Code generation, retrieval, tool use, chat quality, multimodal tasks, and throughput each stress models differently. Use whichllm to narrow the field, then run your own eval prompts and measure local latency, memory pressure, and failure modes before committing to a default model.
whichllm also helps reduce wasted download time. Local model files are large, and trial-and-error selection can burn hours on models that barely fit or run too slowly. A hardware-aware shortlist gives developers a better first pass before they spend time pulling weights and tuning runtimes.