OSI-openLocal inference and "what runs on my machine"

vllm-mlx

waybarrios

vLLM-style local server for Apple Silicon that speaks both the OpenAI and Anthropic APIs, with multimodal support.

1.4k stars(as of 2026-06-26)View on GitHub

Overview

What is vllm-mlx?

A vLLM-style local inference server for Apple Silicon that exposes OpenAI- and Anthropic-compatible APIs at once, running LLMs and vision-language models on a native MLX/Metal backend. It adds continuous batching, paged and prefix KV caching, MCP tool calling, structured JSON output and multimodal (image, video, audio) support, and works as a Claude Code backend.

Analysis

Pros & Cons

Pros

One server speaks both the OpenAI and Anthropic APIs, a drop-in for Claude Code and OpenAI SDK clients
Production-style serving features (continuous batching, paged/prefix cache, metrics) rare in MLX projects
True multimodal: LLMs, vision-language models, plus TTS and STT in one server

Cons

Apple Silicon only, no NVIDIA, CPU or cross-platform path
Pre-1.0 (v0.3.0), APIs and stability still maturing
Headline tokens-per-second figures are self-reported and hardware-specific

License

Apache-2.0 (OSI-open)

When it is interesting

One OpenAI- and Anthropic-compatible local endpoint to run LLMs and vision-language models on Apple Silicon, e.g. as a Claude Code backend.

When it is too early

If you need production-grade stability or non-Apple hardware; it is pre-1.0 and Metal-locked.

This repo featured in the 2026-07 edition of the Open-Source AI Radar.

Similar repositories

oMLX

jundot

16.6k

macOS-native LLM inference server for Apple Silicon with continuous batching and SSD-tiered caching.

OSI-openLocal inference and "what runs on my machine"

apfel

Arthur-Ficial

5.8k

Expose the on-device Apple Intelligence model on macOS 26 as a zero-setup OpenAI-compatible local API.

OSI-openLocal inference and "what runs on my machine"

shimmy

Michael-A-Kuykendall

5.3k

Pure-Rust local inference engine with an OpenAI-compatible API, shipped as one binary.

OSI-openLocal inference and "what runs on my machine"