tokenspeed
lightseekorg
Inference engine pairing a C++ scheduler with CUDA kernels for high-throughput agentic LLM serving.
What is tokenspeed?
An LLM inference engine aimed at agentic workloads, pairing a C++ control-plane scheduler with a Python execution plane and pluggable CUDA kernels (including a multi-head latent attention implementation). It positions itself as 'TensorRT-LLM-level performance with vLLM-level usability', from the non-profit LightSeek Foundation.
Pros & Cons
Pros
- Real low-level engineering (a custom C++ scheduler plus GPU kernels), not a thin wrapper
- MIT-licensed and backed by a non-profit foundation
- Explicitly designed for agentic, high-throughput serving
Cons
- Targets top-end Blackwell/B200-class GPUs, so it is inaccessible to most
- The README lacks in-repo install and usage and points to external docs, a maturity gap
- Flagship throughput numbers (e.g. 580 tokens/s) are self-reported and unbenchmarked by third parties
License
MIT (OSI-open)
When it is interesting
Teams serving large MoE models on datacenter Blackwell-class hardware who want a hackable, kernel-level alternative to vLLM/TensorRT-LLM.
When it is too early
If you lack datacenter GPUs, need stable releases and docs, or require verified benchmarks.
Commercial alternative & related
- Commercial counterpart: NVIDIA NIM
This repo featured in the 2026-07 edition of the Open-Source AI Radar.
oMLX
jundot
macOS-native LLM inference server for Apple Silicon with continuous batching and SSD-tiered caching.
apfel
Arthur-Ficial
Expose the on-device Apple Intelligence model on macOS 26 as a zero-setup OpenAI-compatible local API.
shimmy
Michael-A-Kuykendall
Pure-Rust local inference engine with an OpenAI-compatible API, shipped as one binary.