shimmy
Michael-A-Kuykendall
Pure-Rust local inference engine with an OpenAI-compatible API, shipped as one binary.
What is shimmy?
A pure-Rust inference engine with an OpenAI-API-compatible endpoint, shipped as a single binary: no Python, no llama.cpp. It runs on Vulkan, D3D12 and Metal, so CUDA is not required, and auto-discovers models from HuggingFace, Ollama and LM Studio.
Pros & Cons
Pros
- Single binary, no Python or C++ toolchain
- Broad GPU coverage without a CUDA dependency
- Drop-in OpenAI API for local models
Cons
- The Airframe GPU core cannot be built from source by the public - a real caveat for an 'open' tool
- One model per server instance, no multi-model
- MoE not yet implemented; performance claims (startup <100ms vs Ollama) are unverified project claims
License
Apache-2.0 (OSI-open)
Apache-2.0 per the badges (the README text says MIT - a genuine inconsistency worth checking before you rely on it).
When it is interesting
OpenAI-API drop-in on mixed GPU hardware without Python.
When it is too early
If you need to audit or build the GPU core yourself, or want multi-model serving.
This repo featured in the 2026-06 edition of the Open-Source AI Radar.
oMLX
jundot
macOS-native LLM inference server for Apple Silicon with continuous batching and SSD-tiered caching.
apfel
Arthur-Ficial
Expose the on-device Apple Intelligence model on macOS 26 as a zero-setup OpenAI-compatible local API.
whichllm
Andyyyy64
CLI that detects your hardware and ranks local LLMs that will run well on it, scored against real benchmarks.