🔥 Hot Repo: Beats TensorRT-LLM — vLLM Made It Their Day-0 Partner

LightSeek Foundation's TokenSpeed inference engine outperforms TensorRT-LLM by up to 11% on agentic workloads and shipped with exclusive vLLM day-0 integration on NVIDIA Blackwell.

By OMC Editorial on 2026-05-08

One-liner — TokenSpeed is a new MIT-licensed LLM inference engine that outperforms TensorRT-LLM on agentic workloads while offering vLLM-level usability, with day-0 integration already shipped by vLLM and NVIDIA Dynamo. - Repo: lightseekorg/tokenspeedhttps://github.com/lightseekorg/tokenspeed - Stars: ⭐ 730 +730 in first 48 hours - Language: Python - License: MIT --- What It Does TokenSpeed is a new LLM inference engine from LightSeek Foundation, purpose-built for agentic workloads where contexts routinely exceed 50K tokens across dozens of conversation turns. Its architecture combines a C++ control-plane scheduler with a pluggable kernel system, including one of the fastest MLA Multi-head Latent Attention implementations available for NVIDIA Blackwell GPUs. The stated goal: TensorRT-LLM performance with vLLM usability. Why It's Blowing Up TokenSpeed launched on May 6, 2026 with a concrete benchmark story: on Kimi K2.5 running on NVIDIA B200, it beats TensorRT-LLM by 9% in min-latency batch size 1 and delivers 11% higher throughput at 100 TPS/User — the threshold most coding agents require. More striking is the MLA kernel, which nearly halves decode latency compared to TensorRT-LLM on speculative decoding workloads. The launch was amplified by two partnerships announced the same day. vLLM declared itself TokenSpeed's "exclusive day-0 launch partner," integrating the MLA library directly. NVIDIA Dynamo also shipped day-0 support. These aren't symbolic endorsements — they mean production ML teams can access TokenSpeed's kernel improvements through tools they already run. For a low-level GPU infrastructure project with no demo UI and strict Blackwell hardware requirements, 730 stars in under 48 hours signals the inference community took notice immediately. Key Features - MLA Kernel — nearly halves decode latency vs. TensorRT-LLM on Blackwell for speculative decoding workloads - Local-SPMD Modeling — static compiler generates collective communication from