OSI-openLocal inference and "what runs on my machine"

shimmy

Michael-A-Kuykendall

Pure-Rust local inference engine with an OpenAI-compatible API, shipped as one binary.

5.3k stars(as of 2026-06-05)View on GitHub

Overview

What is shimmy?

A pure-Rust inference engine with an OpenAI-API-compatible endpoint, shipped as a single binary: no Python, no llama.cpp. It runs on Vulkan, D3D12 and Metal, so CUDA is not required, and auto-discovers models from HuggingFace, Ollama and LM Studio.

Analysis

Pros & Cons

Pros

Single binary, no Python or C++ toolchain
Broad GPU coverage without a CUDA dependency
Drop-in OpenAI API for local models

Cons

The Airframe GPU core cannot be built from source by the public - a real caveat for an 'open' tool
One model per server instance, no multi-model
MoE not yet implemented; performance claims (startup <100ms vs Ollama) are unverified project claims

License

Apache-2.0 (OSI-open)

Apache-2.0 per the badges (the README text says MIT - a genuine inconsistency worth checking before you rely on it).

When it is interesting

OpenAI-API drop-in on mixed GPU hardware without Python.

When it is too early

If you need to audit or build the GPU core yourself, or want multi-model serving.

This repo featured in the 2026-06 edition of the Open-Source AI Radar.

Similar repositories

oMLX

jundot

16.6k

macOS-native LLM inference server for Apple Silicon with continuous batching and SSD-tiered caching.

OSI-openLocal inference and "what runs on my machine"

apfel

Arthur-Ficial

5.8k

Expose the on-device Apple Intelligence model on macOS 26 as a zero-setup OpenAI-compatible local API.

OSI-openLocal inference and "what runs on my machine"

whichllm

Andyyyy64

2.8k

CLI that detects your hardware and ranks local LLMs that will run well on it, scored against real benchmarks.

OSI-openLocal inference and "what runs on my machine"