vllm-mlx
waybarrios
vLLM-style local server for Apple Silicon that speaks both the OpenAI and Anthropic APIs, with multimodal support.
What is vllm-mlx?
A vLLM-style local inference server for Apple Silicon that exposes OpenAI- and Anthropic-compatible APIs at once, running LLMs and vision-language models on a native MLX/Metal backend. It adds continuous batching, paged and prefix KV caching, MCP tool calling, structured JSON output and multimodal (image, video, audio) support, and works as a Claude Code backend.
Pros & Cons
Pros
- One server speaks both the OpenAI and Anthropic APIs, a drop-in for Claude Code and OpenAI SDK clients
- Production-style serving features (continuous batching, paged/prefix cache, metrics) rare in MLX projects
- True multimodal: LLMs, vision-language models, plus TTS and STT in one server
Cons
- Apple Silicon only, no NVIDIA, CPU or cross-platform path
- Pre-1.0 (v0.3.0), APIs and stability still maturing
- Headline tokens-per-second figures are self-reported and hardware-specific
License
Apache-2.0 (OSI-open)
When it is interesting
One OpenAI- and Anthropic-compatible local endpoint to run LLMs and vision-language models on Apple Silicon, e.g. as a Claude Code backend.
When it is too early
If you need production-grade stability or non-Apple hardware; it is pre-1.0 and Metal-locked.
This repo featured in the 2026-07 edition of the Open-Source AI Radar.
oMLX
jundot
macOS-native LLM inference server for Apple Silicon with continuous batching and SSD-tiered caching.
apfel
Arthur-Ficial
Expose the on-device Apple Intelligence model on macOS 26 as a zero-setup OpenAI-compatible local API.
shimmy
Michael-A-Kuykendall
Pure-Rust local inference engine with an OpenAI-compatible API, shipped as one binary.