jundot/omlx

↗ GitHub

LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar

11,626

Stars

1,008

Forks

Watchers

223

Open Issues

Python·Apache License 2.0·Last commit Apr 24, 2026·by @jundot·Published April 3, 2026·Analyzed 3mo ago

Safety Rating A

No hardcoded secrets, malicious code patterns, suspicious dependencies, or prompt injection attempts were found. The repository is a well-structured open source inference server project under Apache 2.0, with clear attribution to upstream dependencies (MLX, mlx-lm, vllm-mlx). The optional API key feature is a user-supplied value at runtime, not embedded in code. No red flags identified.

ℹAI-assisted review, not a professional security audit.

AI Analysis

oMLX is an LLM inference server optimized for Apple Silicon Macs, featuring continuous batching, tiered KV caching (hot RAM + cold SSD), and a native macOS menu bar app. It provides an OpenAI and Anthropic-compatible API endpoint, supports text LLMs, vision-language models, embeddings, and rerankers, and includes a web-based admin dashboard for real-time monitoring, model management, and benchmarking. KV cache blocks persist across requests and server restarts via SSD offloading, making local LLM serving practical for agentic coding workflows.

Use Cases

Running local LLMs on Apple Silicon Macs with OpenAI-compatible API
Serving multiple models concurrently with LRU eviction and model pinning
Persisting KV cache to SSD to avoid recomputation across long coding sessions (e.g., with Claude Code)
Downloading and managing MLX-format models from HuggingFace via a web dashboard
Using MCP (Model Context Protocol) tool calling with locally served models
Embedding and reranking documents for RAG pipelines on Apple Silicon