🔥 Hot Repo: Beats TensorRT-LLM — vLLM Made It Their Day-0 Partner
LightSeek Foundation's TokenSpeed inference engine outperforms TensorRT-LLM by up to 11% on agentic workloads and shipped with exclusive vLLM day-0 integration on NVIDIA Blackwell.
By OMC Editorial on 2026-05-08
One-liner — TokenSpeed is a new MIT-licensed LLM inference engine that outperforms TensorRT-LLM on agentic workloads while offering vLLM-level usability, with day-0 integration already shipped by vLLM and NVIDIA Dynamo.
- Repo: lightseekorg/tokenspeedhttps://github.com/lightseekorg/tokenspeed
- Stars: ⭐ 730 +730 in first 48 hours
- Language: Python
- License: MIT
---
What It Does
TokenSpeed is a new LLM inference engine from LightSeek Foundation, purpose-built for agentic workloads where contexts routinely exceed 50K tokens across dozens of conversation turns. Its architecture combines a C++ control-plane scheduler with a pluggable kernel system, including one of the fastest MLA Multi-head Latent Attention implementations available for NVIDIA Blackwell GPUs. The stated goal: TensorRT-LLM performance with vLLM usability.
Why It's Blowing Up
TokenSpeed launched on May 6, 2026 with a concrete benchmark story: on Kimi K2.5 running on NVIDIA B200, it beats TensorRT-LLM by 9% in min-latency batch size 1 and delivers 11% higher throughput at 100 TPS/User — the threshold most coding agents require. More striking is the MLA kernel, which nearly halves decode latency compared to TensorRT-LLM on speculative decoding workloads.
The launch was amplified by two partnerships announced the same day. vLLM declared itself TokenSpeed's "exclusive day-0 launch partner," integrating the MLA library directly. NVIDIA Dynamo also shipped day-0 support. These aren't symbolic endorsements — they mean production ML teams can access TokenSpeed's kernel improvements through tools they already run.
For a low-level GPU infrastructure project with no demo UI and strict Blackwell hardware requirements, 730 stars in under 48 hours signals the inference community took notice immediately.
Key Features
- MLA Kernel — nearly halves decode latency vs. TensorRT-LLM on Blackwell for speculative decoding workloads
- Local-SPMD Modeling — static compiler generates collective communication from