Skip to main content
AI Tool Radar
OSI-openLocal inference and "what runs on my machine"

vllm-mlx

waybarrios

vLLM-style local server for Apple Silicon that speaks both the OpenAI and Anthropic APIs, with multimodal support.

1.4k stars(as of 2026-06-26)View on GitHub

What is vllm-mlx?

A vLLM-style local inference server for Apple Silicon that exposes OpenAI- and Anthropic-compatible APIs at once, running LLMs and vision-language models on a native MLX/Metal backend. It adds continuous batching, paged and prefix KV caching, MCP tool calling, structured JSON output and multimodal (image, video, audio) support, and works as a Claude Code backend.

Pros & Cons

Pros

  • One server speaks both the OpenAI and Anthropic APIs, a drop-in for Claude Code and OpenAI SDK clients
  • Production-style serving features (continuous batching, paged/prefix cache, metrics) rare in MLX projects
  • True multimodal: LLMs, vision-language models, plus TTS and STT in one server

Cons

  • Apple Silicon only, no NVIDIA, CPU or cross-platform path
  • Pre-1.0 (v0.3.0), APIs and stability still maturing
  • Headline tokens-per-second figures are self-reported and hardware-specific

License

Apache-2.0 (OSI-open)

When it is interesting

One OpenAI- and Anthropic-compatible local endpoint to run LLMs and vision-language models on Apple Silicon, e.g. as a Claude Code backend.

When it is too early

If you need production-grade stability or non-Apple hardware; it is pre-1.0 and Metal-locked.

This repo featured in the 2026-07 edition of the Open-Source AI Radar.