← Back to Catalog

jundot/omlx

↗ GitHub

LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar

8,165

Stars

682

Forks

49

Watchers

52

Open Issues

Python·Apache License 2.0·Last commit Apr 3, 2026·by @jundot·Published April 3, 2026·Analyzed 5d ago
A

Safety Rating A

No hardcoded secrets, malicious code patterns, suspicious dependencies, or prompt injection attempts were found. The repository is a well-structured open source inference server project under Apache 2.0, with clear attribution to upstream dependencies (MLX, mlx-lm, vllm-mlx). The optional API key feature is a user-supplied value at runtime, not embedded in code. No red flags identified.

AI-assisted review, not a professional security audit.

AI Analysis

oMLX is an LLM inference server optimized for Apple Silicon Macs, featuring continuous batching, tiered KV caching (hot RAM + cold SSD), and a native macOS menu bar app. It provides an OpenAI and Anthropic-compatible API endpoint, supports text LLMs, vision-language models, embeddings, and rerankers, and includes a web-based admin dashboard for real-time monitoring, model management, and benchmarking. KV cache blocks persist across requests and server restarts via SSD offloading, making local LLM serving practical for agentic coding workflows.

Use Cases

  • Running local LLMs on Apple Silicon Macs with OpenAI-compatible API
  • Serving multiple models concurrently with LRU eviction and model pinning
  • Persisting KV cache to SSD to avoid recomputation across long coding sessions (e.g., with Claude Code)
  • Downloading and managing MLX-format models from HuggingFace via a web dashboard
  • Using MCP (Model Context Protocol) tool calling with locally served models
  • Embedding and reranking documents for RAG pipelines on Apple Silicon

Tags

#llm#server#self-hosted#local-first#desktop-app#api#api-wrapper#mcp#rag#embeddings#context-engineering#streaming#caching#cli-tool

Project Connections

Depends on / used by

mlx-lm

oMLX directly depends on mlx-lm for its BatchGenerator and LLM inference pipeline on Apple Silicon

Depends on / used by

mlx-vlm

oMLX uses mlx-vlm for vision-language model inference support

Inspired by / successor to

vLLM

oMLX's block-based paged KV cache design is explicitly inspired by vLLM, and it evolved from vllm-mlx v0.1.0

Alternative to

LM Studio

Both provide local LLM serving with GUI management on Apple Silicon, targeting a similar developer audience

Alternative to

Ollama

Both are self-hosted local LLM servers with OpenAI-compatible APIs, though Ollama is cross-platform while oMLX is Apple Silicon-specific with deeper macOS integration