{"id":2866,"date":"2025-11-20T14:14:48","date_gmt":"2025-11-20T14:14:48","guid":{"rendered":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2025\/11\/20\/docker-model-runner-integrates-vllm-for-high-throughput-inference\/"},"modified":"2025-11-20T14:14:48","modified_gmt":"2025-11-20T14:14:48","slug":"docker-model-runner-integrates-vllm-for-high-throughput-inference","status":"publish","type":"post","link":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2025\/11\/20\/docker-model-runner-integrates-vllm-for-high-throughput-inference\/","title":{"rendered":"Docker Model Runner Integrates vLLM for High-Throughput Inference"},"content":{"rendered":"<h2 class=\"wp-block-heading\"><strong>Expanding Docker Model Runner\u2019s Capabilities<\/strong><\/h2>\n<p>Today, we\u2019re excited to announce that <a href=\"https:\/\/www.docker.com\/products\/model-runner\/\">Docker Model Runner<\/a> now integrates the vLLM inference engine and safetensors models, unlocking high-throughput AI inference with the same Docker tooling you already use.<\/p>\n<p>When we first introduced Docker Model Runner, our goal was to make it simple for developers to run and experiment with large language models (LLMs) using Docker. We designed it to integrate multiple inference engines from day one, starting with llama.cpp, to make it easy to get models running anywhere.<\/p>\n<p>Now, we\u2019re taking the next step in that journey. With vLLM integration, you can scale AI workloads from low-end to high-end Nvidia hardware, without ever leaving your Docker workflow.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Why vLLM?<\/strong><\/h2>\n<div class=\"wp-block-ponyo-image\">\n            <img data-opt-id=943667618  fetchpriority=\"high\" decoding=\"async\" width=\"1000\" height=\"287\" src=\"https:\/\/www.docker.com\/app\/uploads\/2025\/11\/vLLM-blog-1.png\" class=\"fade-in attachment-full size-full\" alt=\"vLLM blog 1\" title=\"- vLLM blog 1\" \/>\n    <\/div>\n<p><a href=\"https:\/\/docs.vllm.ai\/en\/latest\/\" rel=\"nofollow noopener\" target=\"_blank\">vLLM<\/a> is a high-throughput, open-source inference engine built to serve large language models efficiently at scale. It\u2019s used across the industry for deploying production-grade LLMs thanks to its focus on throughput, latency, and memory efficiency.<\/p>\n<p>Here\u2019s what makes vLLM stand out:<\/p>\n<ul class=\"wp-block-list\">\n<li>Optimized performance: Uses PagedAttention, an advanced attention algorithm that minimizes memory overhead and maximizes GPU utilization.<\/li>\n<li>Scalable serving: Handles batch requests and streaming outputs natively, perfect for interactive and high-traffic AI services.<\/li>\n<li>Model flexibility: Works seamlessly with popular open-weight models like GPT-OSS, Qwen3, Mistral, Llama 3, and others in the safetensors format.<\/li>\n<\/ul>\n<p>By bringing vLLM to Docker Model Runner, we\u2019re bridging the gap between fast local experimentation and robust production inference.<\/p>\n<h2 class=\"wp-block-heading\"><strong>How vLLM Works<\/strong><\/h2>\n<p>Running vLLM models with Docker Model Runner is as simple as installing the backend and running your model, no special setup required.<\/p>\n<p>Install Docker Model Runner with vLLM backend:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; gutter: false; title: ; notranslate\">\ndocker model install-runner --backend vllm --gpu cuda\n<\/pre>\n<\/div>\n<p>Once the installation finishes, you\u2019re ready to start using it right away:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; gutter: false; title: ; notranslate\">\ndocker model run ai\/smollm2-vllm \"Can you read me?\"\n\nSure, I am ready to read you.\n<\/pre>\n<\/div>\n<p>Or access it via API:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; gutter: false; title: ; notranslate\">\ncurl --location 'http:\/\/localhost:12434\/v1\/chat\/completions' \n--header 'Content-Type: application\/json' \n--data '{\n  \"model\": \"ai\/smollm2-vllm\",\n  \"messages\": [\n    {\n      \"role\": \"user\",\n      \"content\": \"Can you read me?\"\n    }\n  ]\n}'\n\n<\/pre>\n<\/div>\n<p>Note that there\u2019s no reference to vLLM in the HTTP request or CLI command.<\/p>\n<p>That\u2019s because Docker Model Runner automatically routes the request to the correct inference engine based on the model you\u2019re using, ensuring a seamless experience whether you\u2019re using llama.cpp or vLLM.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Why Multiple Inference Engines?<\/strong><\/h2>\n<p>Until now, developers had to choose between simplicity and performance. You could either run models easily (using simplified portable tools like Docker Model Runner with llama.cpp) or achieve maximum throughput (with frameworks like vLLM).<\/p>\n<p>Docker Model Runner now gives you both.<\/p>\n<p>You can:<\/p>\n<ul class=\"wp-block-list\">\n<li>Prototype locally with llama.cpp.<\/li>\n<li>Scale to production with vLLM.<\/li>\n<\/ul>\n<p>Use the same consistent Docker commands, CI\/CD workflows, and deployment environments throughout.<\/p>\n<p>This flexibility makes Docker Model Runner a first in the industry \u2014 no other tool lets you switch between multiple inference engines within a single, portable, containerized workflow.<\/p>\n<p>By unifying these engines under one interface, Docker is making AI truly portable, from laptops to clusters, and everything in between.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Safetensors (vLLM) vs. GGUF (llama.cpp): Choosing the Right Format<\/strong><\/h2>\n<p>With the addition of vLLM, Docker Model Runner is now compatible with the two most dominant open-source model formats: Safetensors and GGUF. While Model Runner abstracts the complexity of setting up the engines, understanding the difference between these formats helps in choosing the right tool for your infrastructure.<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>GGUF (GPT-Generated Unified Format):<\/strong> The native format for llama.cpp, GGUF is designed for high portability and quantization. It is excellent for running models on commodity hardware where memory bandwidth is limited. It packages the model architecture and weights into a single file.<\/li>\n<li><strong>Safetensors:<\/strong> The native format for vLLM and the modern standard for high-end inference, safetensors is built for high-throughput performance.<\/li>\n<\/ul>\n<p>Docker Model Runner intelligently routes your request: if you pull a GGUF model, it utilizes llama.cpp; if you pull a safetensors model, it leverages the power of vLLM. With Docker Model Runner, both can be pushed and pulled as OCI images to any OCI registry.<\/p>\n<h2 class=\"wp-block-heading\"><strong>vLLM-compatible models on Docker Hub<\/strong><\/h2>\n<p>vLLM models are in safetensors format. Some early safetensors models available on <a href=\"https:\/\/hub.docker.com\/u\/ai?page=1&amp;search=vllm\" rel=\"nofollow noopener\" target=\"_blank\">Docker Hub<\/a>:<\/p>\n<div class=\"wp-block-ponyo-image\">\n            <img data-opt-id=51935551  fetchpriority=\"high\" decoding=\"async\" width=\"1000\" height=\"456\" src=\"https:\/\/www.docker.com\/app\/uploads\/2025\/11\/vLLM-blog-2.png\" class=\"fade-in attachment-full size-full\" alt=\"vLLM blog 2\" title=\"- vLLM blog 2\" \/>\n    <\/div>\n<h2 class=\"wp-block-heading\"><strong>Available Now: x86_64 with Nvidia<\/strong><\/h2>\n<p>Our initial release is optimized for and available on systems running the x86_64 architecture with Nvidia GPUs. Our team has dedicated its efforts to creating a rock-solid experience on this platform, and we\u2019re confident you\u2019ll feel the difference.<\/p>\n<h2 class=\"wp-block-heading\"><strong>What\u2019s Next?<\/strong><\/h2>\n<p>This launch is just the beginning. Our vLLM roadmap is focused on two key areas: expanding platform access and continuous performance tuning.<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>WSL2\/Docker Desktop compatibility:<\/strong> We know that a seamless \u201cinner loop\u201d is critical for developers. We are actively working to bring the vLLM backend to Windows via WSL2. This will allow you to build, test, and prototype high-throughput AI applications Docker Desktop with the same workflow you use in Linux environments, starting with Nvidia Windows machines.<\/li>\n<li><strong>DGX Spark compatibility:<\/strong> We are optimizing Model Runner for different kinds of hardware. We are working to add compatibility for Nvidia DGX systems.<\/li>\n<li><strong>Performance Optimization:<\/strong> We\u2019re also actively tracking areas for improvement. While vLLM offers incredible throughput, we recognize that its startup time is currently slower than llama.cpp\u2019s. This is a key area we are looking to optimize in future enhancements to improve the \u201ctime-to-first-token\u201d for rapid development cycles.<\/li>\n<\/ul>\n<p>Thank you for your support and patience as we grow.<\/p>\n<h3 class=\"wp-block-heading\"><strong>How You Can Get Involved<\/strong><\/h3>\n<p>The strength of Docker Model Runner lies in its community, and there\u2019s always room to grow. We need your help to make this project the best it can be. To get involved, you can:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Star the repository:<\/strong> Show your support and help us gain visibility by starring the<a href=\"https:\/\/github.com\/docker\/model-runner\" rel=\"nofollow noopener\" target=\"_blank\"> Docker Model Runner repo<\/a>.<\/li>\n<li><strong>Contribute your ideas:<\/strong> Have an idea for a new feature or a bug fix? Create an issue to discuss it. Or fork the repository, make your changes, and submit a pull request. We\u2019re excited to see what ideas you have!<\/li>\n<li><strong>Spread the word:<\/strong> Tell your friends, colleagues, and anyone else who might be interested in running AI models with Docker.<\/li>\n<\/ul>\n<p>We\u2019re incredibly excited about this new chapter for Docker Model Runner, and we can\u2019t wait to see what we can build together. Let\u2019s get to work!<\/p>","protected":false},"excerpt":{"rendered":"<p>Expanding Docker Model Runner\u2019s Capabilities Today, we\u2019re excited to announce that Docker Model Runner now integrates the vLLM inference engine [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":2867,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[4],"tags":[],"class_list":["post-2866","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-docker"],"_links":{"self":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/2866","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/comments?post=2866"}],"version-history":[{"count":0,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/2866\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media\/2867"}],"wp:attachment":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media?parent=2866"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/categories?post=2866"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/tags?post=2866"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}