Skip to main content
AI Tool Radar
OSI-openLocal inference and "what runs on my machine"

tokenspeed

lightseekorg

Inference engine pairing a C++ scheduler with CUDA kernels for high-throughput agentic LLM serving.

1.5k stars(as of 2026-06-26)View on GitHubHomepage

What is tokenspeed?

An LLM inference engine aimed at agentic workloads, pairing a C++ control-plane scheduler with a Python execution plane and pluggable CUDA kernels (including a multi-head latent attention implementation). It positions itself as 'TensorRT-LLM-level performance with vLLM-level usability', from the non-profit LightSeek Foundation.

Pros & Cons

Pros

  • Real low-level engineering (a custom C++ scheduler plus GPU kernels), not a thin wrapper
  • MIT-licensed and backed by a non-profit foundation
  • Explicitly designed for agentic, high-throughput serving

Cons

  • Targets top-end Blackwell/B200-class GPUs, so it is inaccessible to most
  • The README lacks in-repo install and usage and points to external docs, a maturity gap
  • Flagship throughput numbers (e.g. 580 tokens/s) are self-reported and unbenchmarked by third parties

License

MIT (OSI-open)

When it is interesting

Teams serving large MoE models on datacenter Blackwell-class hardware who want a hackable, kernel-level alternative to vLLM/TensorRT-LLM.

When it is too early

If you lack datacenter GPUs, need stable releases and docs, or require verified benchmarks.

Commercial alternative & related

  • Commercial counterpart: NVIDIA NIM

This repo featured in the 2026-07 edition of the Open-Source AI Radar.