Skip to main content
AI Tool Radar
OSI-openLocal inference and "what runs on my machine"

shimmy

Michael-A-Kuykendall

Pure-Rust local inference engine with an OpenAI-compatible API, shipped as one binary.

5.3k stars(as of 2026-06-05)View on GitHub

What is shimmy?

A pure-Rust inference engine with an OpenAI-API-compatible endpoint, shipped as a single binary: no Python, no llama.cpp. It runs on Vulkan, D3D12 and Metal, so CUDA is not required, and auto-discovers models from HuggingFace, Ollama and LM Studio.

Pros & Cons

Pros

  • Single binary, no Python or C++ toolchain
  • Broad GPU coverage without a CUDA dependency
  • Drop-in OpenAI API for local models

Cons

  • The Airframe GPU core cannot be built from source by the public - a real caveat for an 'open' tool
  • One model per server instance, no multi-model
  • MoE not yet implemented; performance claims (startup <100ms vs Ollama) are unverified project claims

License

Apache-2.0 (OSI-open)

Apache-2.0 per the badges (the README text says MIT - a genuine inconsistency worth checking before you rely on it).

When it is interesting

OpenAI-API drop-in on mixed GPU hardware without Python.

When it is too early

If you need to audit or build the GPU core yourself, or want multi-model serving.

This repo featured in the 2026-06 edition of the Open-Source AI Radar.