π₯ Hot Repo: SSD Cache Cuts Claude Code Context Load From 90s to 3s on Mac
oMLX is the Apple Silicon LLM server turning your Mac into a serious coding agent host β its SSD-persisted KV cache eliminates the long prefill wait, and a fresh release just added Gemma 4 MTP and Copilot CLI support.
By OMC Editorial on 2026-05-13
One-liner β oMLX is an Apple Siliconβnative LLM inference server that persists KV cache blocks to SSD, slashing repeated-context load times from 30β90 s down to 1β3 s and making local Claude Code sessions genuinely practical on Mac.
- Repo: jundot/omlxhttps://github.com/jundot/omlx
- Stars: β 13,890
- Language: Python
- License: Apache 2.0
---
What It Does
oMLX is a local LLM inference server built on Apple's MLX framework for M1βM4 Macs. It serves any MLX-format model β text, vision, embedding, reranker β behind an OpenAI-compatible API at localhost:8000. Its core innovation is a two-tier KV cache: hot blocks stay in RAM, cold blocks offload to SSD in safetensors format. Crucially, cache blocks survive server restarts β the next session with a matching prefix reloads from disk instead of recomputing from scratch.
Why It Is Blowing Up
Apple Silicon Macs with unified memory up to 512 GB on the M3 Ultra have become competitive inference machines, but the developer experience was still painful: every new Claude Code session with a long system prompt sat waiting 30β90 seconds for prefill. oMLX eliminates that penalty for repeated prefixes, which is exactly the pattern local coding agents generate β the same big system prompt, every single session.
A fresh dev release landed May 12, 2026 v0.3.9.dev2 adding Gemma 4 multi-token prediction on both vision and text paths, DFlash engine support for Gemma 4, and omlx launch copilot β GitHub Copilot CLI now joins Claude, Codex, OpenClaw, and OpenCode as a one-command launch target. The release also ships an in-admin "Restart Server" button, auto-proxy-build for quantizing models too large to fit in RAM, and ParoQuant support via a pluggable quantization dispatcher.
The project includes a dedicated Claude Code optimization layer: it scales reported token counts so Claude's auto-compact fires at the right moment, and sends SSE keep-alive pings to prevent read timeouts during long prefill on heavy models.
Key Features