Docker Model Runner Brings vLLM to macOS with Apple Silicon

vLLM has quickly become the go-to inference engine for developers who need high-throughput LLM serving. We brought vLLM to Docker Model Runner for NVIDIA GPUs on Linux, then extended it to Windows via WSL2.

That changes today. Docker Model Runner now supports vllm-metal, a new backend that brings vLLM inference to macOS using Apple Silicon’s Metal GPU. If you have a Mac with an M-series chip, you can now run MLX models through vLLM with the same OpenAI-compatible API, same Anthropic-compatible API for tools like Claude Code, and all in one, the same Docker workflow.

What is vllm-metal?

vllm-metal is a plugin for vLLM that brings high-performance LLM inference to Apple Silicon. Developed in collaboration between Docker and the vLLM project, it unifies MLX, the Apple’s machine learning framework, and PyTorch under a single compute pathway, plugging directly into vLLM’s existing engine, scheduler, and OpenAI-compatible API server.

The architecture is layered: vLLM’s core (engine, scheduler, tokenizer, API) stays unchanged on top. A plugin layer consisting of MetalPlatform, MetalWorker, and MetalModelRunner handles the Apple Silicon specifics. Underneath, MLX drives the actual inference while PyTorch handles model loading and weight conversion. The whole stack runs on Metal, Apple’s GPU framework.

+-------------------------------------------------------------+
|                          vLLM Core                          |
|        Engine | Scheduler | API | Tokenizers                |
+-------------------------------------------------------------+
                             |
                             v
+-------------------------------------------------------------+
|                   vllm_metal Plugin Layer                   |
|   +-----------+  +-----------+  +------------------------+  |
|   | Platform  |  | Worker    |  | ModelRunner            |  |
|   +-----------+  +-----------+  +------------------------+  |
+-------------------------------------------------------------+
                             |
                             v
+-------------------------------------------------------------+
|                   Unified Compute Backend                   |
|   +------------------+    +----------------------------+    |
|   | MLX (Primary)    |    | PyTorch (Interop)          |    |
|   | - SDPA           |    | - HF Loading               |    |
|   | - RMSNorm        |    | - Weight Conversion        |    |
|   | - RoPE           |    | - Tensor Bridge            |    |
|   | - Cache Ops      |    |                            |    |
|   +------------------+    +----------------------------+    |
+-------------------------------------------------------------+
                             |
                             v
+-------------------------------------------------------------+
|                       Metal GPU Layer                       |
|           Apple Silicon Unified Memory Architecture         |
+-------------------------------------------------------------+

Figure 1: High-level architecture diagram of vllm-metal. Credit: vllm-metal

What makes this particularly effective on Apple Silicon is unified memory. Unlike discrete GPUs where data must be copied between CPU and GPU memory, Apple Silicon shares a single memory pool. vllm-metal exploits this with zero-copy tensor operations. Combined with paged attention for efficient KV cache management and Grouped-Query Attention support, this means you can serve longer sequences with less memory waste.

vllm-metal runs MLX models published by the mlx-community on Hugging Face. These models are built specifically for the MLX framework and take full advantage of Metal GPU acceleration. Docker Model Runner automatically routes MLX models to vllm-metal when the backend is installed, falling back to the built-in MLX backend otherwise.

How vllm-metal works

vllm-metal runs natively on the host. This is necessary because Metal GPU access requires direct hardware access and there is no GPU passthrough for Metal in containers.

When you install the backend, Docker Model Runner:

  1. Pulls a Docker image from Hub that contains a self-contained Python 3.12 environment with vllm-metal and all dependencies pre-packaged.
  2. Extracts it to `~/.docker/model-runner/vllm-metal/`.
  3. Verifies the installation by importing the `vllm_metal` module.

When a request comes in for a compatible model, the Docker Model Runner’s scheduler starts a vllm-metal server process that communicates over TCP, serving the standard OpenAI API. The model is loaded from Docker’s shared model store, which contains all the models you pull with `docker model pull`.

Which models work with vllm-metal?

vllm-metal works with safetensors models in MLX format. The mlx-community on Hugging Face maintains a large collection of quantized models optimized for Apple Silicon. Some examples you can try:

vLLM everywhere with Docker Model Runner

With vllm-metal, Docker Model Runner now supports vLLM across the three major platforms:

Platform

Backend

GPU


Linux



vllm



NVIDIA (CUDA)



Windows (WSL2)



vllm



NVIDIA (CUDA)



macOS



vllm-metal



Apple Silicon (Metal)


The same docker model commands work regardless of platform. Pull a model, run it. Docker Model Runner picks the right backend for your platform.

Get started

Update to Docker Desktop 4.62 or later for Mac, and install the backend:

docker model install-runner --backend vllm-metal

Check out the Docker Model Runner documentation to learn more. For contributions, feedback, and bug reports, visit the docker/model-runner repository on GitHub.

Giving Back: vllm-metal is Now Open Source

At Docker, we believe that the best way to accelerate AI development is to build in the open. That is why we are proud to announce that Docker has contributed the vllm-metal project to the vLLM community. Originally developed by Docker engineers to power Model Runner on macOS, this project now lives under the vLLM GitHub organization. This ensures that every developer in the ecosystem can benefit from and contribute to high-performance inference on Apple Silicon. The project also has had significant contributions by Lik Xun Yuan, Ricky Chen and Ranran Haoran Zhang.

The $599 AI Development Rig

For a long time, high-throughput vLLM development was gated behind a significant GPU cost. To get started, you typically need a dedicated Linux box with an RTX 4090 ($1,700+) or enterprise-grade A100/H100 cards ($10,000+).

vllm-metal changes the math

Now, a base $599 Mac Mini with an M4 chip becomes a viable vLLM development environment. Because Apple Silicon uses Unified Memory, that 16GB (or upgraded 32GB/64GB) of RAM is directly accessible by the GPU. This allows you to:

  • Develop & Test Locally: Build your vLLM-based applications on the same machine you use for coding.
  • Production-Mirroring: Use the exact same OpenAI-compatible API on your Mac Mini as you would on an H100 cluster in production.
  • Energy Efficiency: Run inference at a fraction of the power consumption (and heat) of a discrete GPU rig.

How does vllm-metal compare to llama.cpp?

We benchmarked both backends using Llama 3.2 1B Instruct with comparable 4-bit quantization, served through Docker Model Runner on Apple Silicon.

llama.cpp

vLLM-Metal


Model



unsloth/Llama-3.2-1B-Instruct-GGUF:Q4_0



mlx-community/llama-3.2-1b-instruct-4bit



Format



GGUF (Q4_0)



Safetensors (MLX 4-bit)


Throughput (tokens/sec, wall-clock)

max_tokens

llama.cpp

vLLM-Metal

speedup


128



333.3



251.5



1.3x



512



345.1



279.0



1.3x



1024



338.5



275.4



1.2x



2048



339.1



279.5



1.2x


Each configuration was run 3 times across 3 different prompts (9 total requests per data point).

Throughput is measured as completion_tokens / wall_clock_time, applied consistently to both backends.

Key observations:

  • llama.cpp is consistently ~1.2x faster than vLLM-Metal across all output lengths.
  • llama.cpp throughput is remarkably stable (~333-345 tok/s regardless of max_tokens), while vLLM-Metal shows more variance between individual runs (134-343 tok/s).
  • Both backends scale well. Neither backend shows significant degradation as output length increases.
  • Quantization methods differ (GGUF Q4_0 vs MLX 4-bit), so this benchmarks the full stack, engine + quantization, rather than the engine alone.

The benchmark script used for these results is available as a GitHub Gist.

How You Can Get Involved

The strength of Docker Model Runner lies in its community, and there’s always room to grow. To get involved:

  • Star the repository: Show your support by starring the Docker Model Runner repo.
  • Contribute your ideas: Create an issue or submit a pull request. We’re excited to see what ideas you have!
  • Spread the word: Tell your friends and colleagues who might be interested in running AI models with Docker.

We’re incredibly excited about this new chapter for Docker Model Runner, and we can’t wait to see what we can build together. Let’s get to work!

Learn More

Scroll to Top