{"id":2088,"date":"2025-06-03T19:23:06","date_gmt":"2025-06-03T19:23:06","guid":{"rendered":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2025\/06\/03\/how-to-make-an-ai-chatbot-from-scratch-using-docker-model-runner\/"},"modified":"2025-06-03T19:23:06","modified_gmt":"2025-06-03T19:23:06","slug":"how-to-make-an-ai-chatbot-from-scratch-using-docker-model-runner","status":"publish","type":"post","link":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2025\/06\/03\/how-to-make-an-ai-chatbot-from-scratch-using-docker-model-runner\/","title":{"rendered":"How to Make an AI Chatbot from Scratch using Docker Model Runner"},"content":{"rendered":"<p>Today, we\u2019ll show you how to build a fully functional Generative AI chatbot using <a href=\"https:\/\/www.docker.com\/blog\/introducing-docker-model-runner\/\">Docker Model Runner<\/a> and powerful observability tools, including Prometheus, Grafana, and Jaeger. We\u2019ll walk you through the common challenges developers face when building AI-powered applications, demonstrate how Docker Model Runner solves these pain points, and then guide you step-by-step through building a production-ready chatbot with comprehensive monitoring and metrics.<\/p>\n<p>By the end of this guide, you\u2019ll know how to make an AI chatbot and run it locally. You\u2019ll also learn how to set up real-time monitoring insights, streaming responses, and a modern React interface \u2014 all orchestrated through familiar Docker workflows.<\/p>\n<h2 class=\"wp-block-heading\">The current challenges with GenAI development<\/h2>\n<p>Generative AI (GenAI) is revolutionizing software development, but creating AI-powered applications comes with significant challenges. First, the current AI landscape is fragmented \u2014 developers must piece together various libraries, frameworks, and platforms that weren\u2019t designed to work together. Second, running large language models efficiently requires specialized hardware configurations that vary across platforms, while AI model execution remains disconnected from standard container workflows. This forces teams to maintain separate environments for their application code and AI models.<\/p>\n<p>Third, without standardized methods for storing, versioning, and serving models, development teams struggle with inconsistent deployment practices. Meanwhile, relying on cloud-based AI services creates financial strain through unpredictable costs that scale with usage. Additionally, sending data to external AI services introduces privacy and security risks, especially for applications handling sensitive information.<\/p>\n<p>These challenges combine to create a frustrating developer experience that hinders experimentation and slows innovation precisely when businesses need to accelerate their AI adoption. Docker Model Runner addresses these pain points by providing a streamlined solution for <a href=\"https:\/\/www.docker.com\/blog\/run-llms-locally\/\">running AI models locally<\/a>, right within your existing Docker workflow.<\/p>\n<h2 class=\"wp-block-heading\">How Docker is solving these challenges<\/h2>\n<p><a href=\"https:\/\/www.docker.com\/blog\/introducing-docker-model-runner\/\">Docker Model Runner<\/a> offers a revolutionary approach to GenAI development by integrating AI model execution directly into familiar container workflows.\u00a0<\/p>\n<div class=\"wp-block-ponyo-image\">\n<\/div>\n<p class=\"has-small-font-size\"><em>Figure 1: Comparison diagram showing complex multi-step traditional GenAI setup versus simplified Docker Model Runner single-command workflow<\/em><\/p>\n\n<p>Many developers successfully use <a href=\"https:\/\/hub.docker.com\/catalogs\/models\" target=\"_blank\">containerized AI models<\/a>, benefiting from integrated workflows, cost control, and data privacy. Docker Model Runner builds on these strengths by making it even easier and more efficient to work with models. By running models natively on your host machine while maintaining the familiar Docker interface, Model Runner delivers.<\/p>\n<p><strong>Simplified Model Execution<\/strong>: Run AI models locally with a simple Docker CLI command, no complex setup required.<\/p>\n<p><strong>Hardware Acceleration<\/strong>: Direct access to GPU resources without containerization overhead<\/p>\n<p><strong>Integrated Workflow<\/strong>: Seamless integration with existing Docker tools and container development practices<\/p>\n<p><strong>Standardized Packaging<\/strong>: Models are distributed as OCI artifacts through the same registries you already use<\/p>\n<p><strong>Cost Control<\/strong>: Eliminate unpredictable API costs by running models locally<\/p>\n<p><strong>Data Privacy<\/strong>: Keep sensitive data within your infrastructure with no external API calls<\/p>\n<p>This approach fundamentally changes how developers can build and test AI-powered applications, making local development faster, more secure, and dramatically more efficient.<\/p>\n\n<h2 class=\"wp-block-heading\">How to create an AI chatbot with Docker<\/h2>\n<p>In this guide, we\u2019ll build a comprehensive GenAI application that showcases how to create a fully-featured chat interface powered by Docker Model Runner, complete with advanced observability tools to monitor and optimize your AI models.<\/p>\n<h3 class=\"wp-block-heading\">Project overview<\/h3>\n<p>The project is a complete Generative AI interface that demonstrates how to:<\/p>\n<p>Create a responsive React\/TypeScript chat UI with streaming responses<\/p>\n<p>Build a Go backend server that integrates with Docker Model Runner<\/p>\n<p>Implement comprehensive observability with metrics, logging, and tracing<\/p>\n<p>Monitor AI model performance with real-time metrics<\/p>\n<h3 class=\"wp-block-heading\">Architecture<\/h3>\n<p>The application consists of these main components:<\/p>\n<p>The frontend sends chat messages to the backend API<\/p>\n<p>The backend formats the messages and sends them to the Model Runner<\/p>\n<p>The LLM processes the input and generates a response<\/p>\n<p>The backend streams the tokens back to the frontend as they\u2019re generated<\/p>\n<p>The frontend displays the incoming tokens in real-time<\/p>\n<p>Observability components collect metrics, logs, and traces throughout the process<\/p>\n<div class=\"wp-block-ponyo-image\">\n<\/div>\n<p class=\"has-small-font-size\"><em>Figure 2: Architecture diagram showing data flow between frontend, backend, Model Runner, and observability tools like Prometheus, Grafana, and Jaeger.<\/em><\/p>\n\n<h3 class=\"wp-block-heading\">Project structure<\/h3>\n<p>The <a href=\"https:\/\/github.com\/dockersamples\/genai-model-runner-metrics\" target=\"_blank\">project<\/a> has the following structure:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\ntree -L 2<br \/>\n.<br \/>\n\u251c\u2500\u2500 Dockerfile<br \/>\n\u251c\u2500\u2500 README-model-runner.md<br \/>\n\u251c\u2500\u2500 README.md<br \/>\n\u251c\u2500\u2500 backend.env<br \/>\n\u251c\u2500\u2500 compose.yaml<br \/>\n\u251c\u2500\u2500 frontend<br \/>\n..<br \/>\n\u251c\u2500\u2500 go.mod<br \/>\n\u251c\u2500\u2500 go.sum<br \/>\n\u251c\u2500\u2500 grafana<br \/>\n\u2502   \u2514\u2500\u2500 provisioning<br \/>\n\u251c\u2500\u2500 main.go<br \/>\n\u251c\u2500\u2500 main_branch_update.md<br \/>\n\u251c\u2500\u2500 observability<br \/>\n\u2502   \u2514\u2500\u2500 README.md<br \/>\n\u251c\u2500\u2500 pkg<br \/>\n\u2502   \u251c\u2500\u2500 health<br \/>\n\u2502   \u251c\u2500\u2500 logger<br \/>\n\u2502   \u251c\u2500\u2500 metrics<br \/>\n\u2502   \u251c\u2500\u2500 middleware<br \/>\n\u2502   \u2514\u2500\u2500 tracing<br \/>\n\u251c\u2500\u2500 prometheus<br \/>\n\u2502   \u2514\u2500\u2500 prometheus.yml<br \/>\n\u251c\u2500\u2500 refs<br \/>\n\u2502   \u2514\u2500\u2500 heads<br \/>\n..\n<p>21 directories, 33 files\n<\/p><\/div>\n\n<p>We\u2019ll examine the key files and understand how they work together throughout this guide.<\/p>\n<h3 class=\"wp-block-heading\">Prerequisites<\/h3>\n<p>Before we begin, make sure you have:<\/p>\n<p><a href=\"https:\/\/www.docker.com\/products\/docker-desktop\/\">Docker Desktop<\/a> (version 4.40 or newer)\u00a0<\/p>\n<p><a href=\"https:\/\/docs.docker.com\/model-runner\/#enable-docker-model-runner\" target=\"_blank\">Docker Model Runner<\/a> enabled<\/p>\n<p>At least 16GB of RAM for running AI models efficiently<\/p>\n<p>Familiarity with Go (for backend development)<\/p>\n<p>Familiarity with React and TypeScript (for frontend development)<\/p>\n<h3 class=\"wp-block-heading\">Getting started<\/h3>\n<p>To run the application:<\/p>\n<p>Clone the repository:\u00a0<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\ngit clone<br \/>\nhttps:\/\/github.com\/dockersamples\/genai-model-runner-metrics\n<p>cd genai-model-runner-metrics<\/p>\n<\/div>\n<p>Enable Docker Model Runner in Docker Desktop:<\/p>\n<p>Go to Settings &gt; Features in Development &gt; Beta tab<\/p>\n<p>Enable \u201cDocker Model Runner\u201d<\/p>\n<p>Select \u201cApply and restart\u201d<\/p>\n<div class=\"wp-block-ponyo-image\">\n<\/div>\n<p class=\"has-small-font-size\"><em>Figure 3: Screenshot of Docker Desktop Beta Features settings panel with Docker AI, Docker Model Runner, and TCP support enabled.<\/em><\/p>\n<p>Download the model<\/p>\n<p>For this demo, we\u2019ll use Llama 3.2, but you can substitute any model of your choice:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\ndocker model pull ai\/llama3.2:1B-Q8_0\n<\/div>\n<p>Just like viewing containers, you can manage your downloaded AI models directly in Docker Dashboard under the Models section. Here you can see model details, storage usage, and manage your local AI model library.<\/p>\n<div class=\"wp-block-ponyo-image\">\n<\/div>\n<p class=\"has-small-font-size\"><em>Figure 4: View of Docker Dashboard showing locally downloaded AI models with details like size, parameters, and quantization.<\/em><\/p>\n<p>Start the application:\u00a0<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\ndocker compose up -d &#8211;build\n<\/div>\n<div class=\"wp-block-ponyo-image\">\n<\/div>\n<p class=\"has-small-font-size\"><em>Figure 5: List of active running containers in Docker Dashboard, including Jaeger, Prometheus, backend, frontend, and genai-model-runner-metrics.<\/em><\/p>\n<p>Open your browser and navigate to the frontend URL at <a href=\"http:\/\/localhost:3000\/\" target=\"_blank\">http:\/\/localhost:3000<\/a> . You\u2019ll be greeted with a modern chat interface (see screenshot) featuring:\u00a0<\/p>\n<p>Clean, responsive design with dark\/light mode toggle<\/p>\n<p>Message input area ready for your first prompt<\/p>\n<p>Model information displayed in the footer<\/p>\n<div class=\"wp-block-ponyo-image\">\n<\/div>\n<p class=\"has-small-font-size\"><em>Figure 6: GenAI chatbot interface showing live metrics panel with input\/output tokens, response time, and error rate.<\/em><\/p>\n<p>Click on Expand to view the metrics like:<\/p>\n<p>Input tokens<\/p>\n<p>Output tokens<\/p>\n<p>Total Requests<\/p>\n<p>Average Response Time<\/p>\n<p>Error Rate<\/p>\n<div class=\"wp-block-ponyo-image\">\n<\/div>\n<p class=\"has-small-font-size\"><em>Figure 7: Expanded metrics view with input and output tokens, detailed chat prompt, and response generated by Llama 3.2 model.<\/em><\/p>\n<p>Grafana allows you to visualize metrics through customizable dashboards. Click on View Detailed Dashboard to open up Grafana dashboard.<\/p>\n<div class=\"wp-block-ponyo-image\">\n<\/div>\n<p class=\"has-small-font-size\"><em>Figure 8: Chat interface showing metrics dashboard with prompt and response plus option to view detailed metrics in Grafana.<\/em><\/p>\n<p>Log in with the default credentials (enter \u201cadmin\u201d as user and password) to explore pre-configured AI performance dashboards (see screenshot below) showing real-time metrics like tokens per second, memory usage, and model performance.\u00a0<\/p>\n<p>Select Add your first data source. Choose Prometheus as a data source. Enter \u201c<a href=\"http:\/\/prometheus:9090\/\" target=\"_blank\">http:\/\/prometheus:9090<\/a>\u201d as Prometheus Server URL. Scroll down to the end of the site and click \u201cSave and test\u201d. By now, you should see \u201cSuccessfully queried the Prometheus API\u201d as an acknowledgement. Select Dashboard and click Re-import for all these dashboards.<\/p>\n<p>By now, you should have a Prometheus 2.0 Stats dashboard up and running.<\/p>\n<div class=\"wp-block-ponyo-image\">\n<\/div>\n<p class=\"has-small-font-size\"><em>Figure 9: Grafana dashboard with multiple graph panels monitoring GenAI chatbot performance, displaying time-series charts for memory consumption, processing speeds, and application health<\/em><\/p>\n\n<p>Prometheus allows you to collect and store time-series metrics data. Open the Prometheus query interface <a href=\"http:\/\/localhost:9091\/\" target=\"_blank\">http:\/\/localhost:9091<\/a> and start typing \u201cgenai\u201d in the query box to explore all available AI metrics (as shown in the screenshot below). You\u2019ll see dozens of automatically collected metrics, including tokens per second, latency measurements, and llama.cpp-specific performance data.\u00a0<\/p>\n<div class=\"wp-block-ponyo-image\">\n<\/div>\n<p class=\"has-small-font-size\"><em>Figure 10: Prometheus web interface showing dropdown of available GenAI metrics including genai_app_active_requests and genai_app_token_latency<\/em><\/p>\n\n<p>Jaeger provides a visual exploration of request flows and performance bottlenecks. You can access it via <a href=\"http:\/\/localhost:16686\/\" target=\"_blank\">http:\/\/localhost:16686<\/a><\/p>\n<h3 class=\"wp-block-heading\">Implementation details<\/h3>\n<p>Let\u2019s explore how the key components of the project work:<\/p>\n<p>Frontend implementation<\/p>\n<p>The React frontend provides a clean, responsive chat interface built with TypeScript and modern React patterns. The core App.tsx component manages two essential pieces of state: dark mode preferences for user experience and model metadata fetched from the backend\u2019s health endpoint.\u00a0<\/p>\n<p>When the component mounts, the useEffect hook automatically retrieves information about the currently running AI model. It displays details like the model name directly in the footer to give users transparency about which LLM is powering their conversations.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n\/\/ Essential App.tsx structure<br \/>\nfunction App() {<br \/>\n  const [darkMode, setDarkMode] = useState(false);<br \/>\n  const [modelInfo, setModelInfo] = useState&lt;ModelMetadata | null&gt;(null);\n<p>  \/\/ Fetch model info from backend<br \/>\n  useEffect(() =&gt; {<br \/>\n    fetch(&#8216;http:\/\/localhost:8080\/health&#8217;)<br \/>\n      .then(res =&gt; res.json())<br \/>\n      .then(data =&gt; setModelInfo(data.model_info));<br \/>\n  }, []);<\/p>\n<p>  return (<br \/>\n    &lt;div className=&#8221;min-h-screen bg-white dark:bg-gray-900&#8243;&gt;<br \/>\n      &lt;Header toggleDarkMode={() =&gt; setDarkMode(!darkMode)} \/&gt;<br \/>\n      &lt;ChatBox \/&gt;<br \/>\n      &lt;footer&gt;<br \/>\n        Powered by Docker Model Runner running {modelInfo?.model}<br \/>\n      &lt;\/footer&gt;<br \/>\n    &lt;\/div&gt;<br \/>\n  );<br \/>\n}<\/p>\n<\/div>\n<p>The main App component orchestrates the overall layout while delegating specific functionality to specialized components like Header for navigation controls and ChatBox for the actual conversation interface. This separation of concerns makes the codebase maintainable while the automatic model info fetching demonstrates how the frontend seamlessly integrates with the Docker Model Runner through the Go backend\u2019s API, creating a unified user experience that abstracts away the complexity of local AI model execution.<\/p>\n<p>Backend implementation: Integration with Model Runner<\/p>\n<p>The core of this application is a Go backend that communicates with Docker Model Runner. Let\u2019s examine the key parts of our main.go file:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\nclient := openai.NewClient(<br \/>\n    option.WithBaseURL(baseURL),<br \/>\n    option.WithAPIKey(apiKey),<br \/>\n)\n<\/div>\n<p>This demonstrates how we leverage Docker Model Runner\u2019s OpenAI-compatible API. The Model Runner exposes endpoints that match OpenAI\u2019s API structure, allowing us to use standard clients. Depending on your connection method, baseURL is set to either:<\/p>\n<p>http:\/\/model-runner.docker.internal\/engines\/llama.cpp\/v1\/ (for Docker socket)<\/p>\n<p>http:\/\/host.docker.internal:12434\/engines\/llama.cpp\/v1\/ (for TCP)<\/p>\n<h3 class=\"wp-block-heading\">How metrics flow from host to containers<\/h3>\n<p>One key architectural detail worth understanding: llama.cpp runs natively on your host (via Docker Model Runner), while Prometheus and Grafana run in containers. Here\u2019s how they communicate:<\/p>\n<p><strong>The Backend as Metrics Bridge:<\/strong><\/p>\n<p><strong>Connects<\/strong> to llama.cpp via Model Runner API (http:\/\/localhost:12434)<\/p>\n<p><strong>Collects<\/strong> performance data from each API call (response times, token counts)<\/p>\n<p><strong>Calculates<\/strong> metrics like tokens per second and memory usage<\/p>\n<p><strong>Exposes<\/strong> all metrics in Prometheus format at http:\/\/backend:9090\/metrics<\/p>\n<p><strong>Enables<\/strong> containerized Prometheus to scrape metrics without host access<\/p>\n<p>This hybrid architecture gives you the performance benefits of native model execution with the convenience of containerized observability.<\/p>\n<h3 class=\"wp-block-heading\">LLama.cpp metrics integration<\/h3>\n<p>The project provides detailed real-time metrics specifically for llama.cpp models:<\/p>\n<div class=\"wp-block-ponyo-table style__default\">\n<p><strong>Metric<\/strong><\/p>\n<p><strong>Description<\/strong><\/p>\n<p><strong>Implementation in Code<\/strong><\/p>\n<p>Tokens per Second<\/p>\n<p>Measure of model generation speed<\/p>\n<p>LlamaCppTokensPerSecond in metrics.go<\/p>\n<p>Context Window Size<\/p>\n<p>Maximum context length in tokens<\/p>\n<p>LlamaCppContextSize in metrics.go<\/p>\n<p>Prompt Evaluation Time<\/p>\n<p>Time spent processing input prompt<\/p>\n<p>LlamaCppPromptEvalTime in metrics.go<\/p>\n<p>Memory per Token<\/p>\n<p>Memory efficiency measurement<\/p>\n<p>LlamaCppMemoryPerToken in metrics.go<\/p>\n<p>Thread Utilization<\/p>\n<p>Number of CPU threads used<\/p>\n<p>LlamaCppThreadsUsed in metrics.go<\/p>\n<p>Batch Size<\/p>\n<p>Token processing batch size<\/p>\n<p>LlamaCppBatchSize in metrics.go<\/p>\n<\/div>\n<p>One of the most powerful features is our detailed metrics collection for llama.cpp models. These metrics help optimize model performance and identify bottlenecks in your inference pipeline.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n\/\/ LlamaCpp metrics<br \/>\nllamacppContextSize = promautoFactory.NewGaugeVec(<br \/>\n    prometheus.GaugeOpts{<br \/>\n        Name: &#8220;genai_app_llamacpp_context_size&#8221;,<br \/>\n        Help: &#8220;Context window size in tokens for llama.cpp models&#8221;,<br \/>\n    },<br \/>\n    []string{&#8220;model&#8221;},<br \/>\n)\n<p>llamacppTokensPerSecond = promautoFactory.NewGaugeVec(<br \/>\n    prometheus.GaugeOpts{<br \/>\n        Name: &#8220;genai_app_llamacpp_tokens_per_second&#8221;,<br \/>\n        Help: &#8220;Tokens generated per second&#8221;,<br \/>\n    },<br \/>\n    []string{&#8220;model&#8221;},<br \/>\n)<\/p>\n<p>\/\/ More metrics definitions&#8230;<\/p>\n<\/div>\n<p>These metrics are collected, processed, and exposed both for Prometheus scraping and for real-time display in the front end. This gives us unprecedented visibility into how the llama.cpp inference engine is performing.<\/p>\n\n<h3 class=\"wp-block-heading\">Chat implementation with streaming<\/h3>\n<p>The chat endpoint implements streaming for real-time token generation:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<p>\/\/ Set up streaming with a proper SSE format<br \/>\nw.Header().Set(&#8220;Content-Type&#8221;, &#8220;text\/event-stream&#8221;)<br \/>\nw.Header().Set(&#8220;Cache-Control&#8221;, &#8220;no-cache&#8221;)<br \/>\nw.Header().Set(&#8220;Connection&#8221;, &#8220;keep-alive&#8221;)<\/p>\n<p>\/\/ Stream each chunk as it arrives<br \/>\nif len(chunk.Choices) &gt; 0 &amp;&amp; chunk.Choices[0].Delta.Content != &#8220;&#8221; {<br \/>\n    outputTokens++<br \/>\n    _, err := fmt.Fprintf(w, &#8220;%s&#8221;, chunk.Choices[0].Delta.Content)<br \/>\n    if err != nil {<br \/>\n        log.Printf(&#8220;Error writing to stream: %v&#8221;, err)<br \/>\n        return<br \/>\n    }<br \/>\n    w.(http.Flusher).Flush()<br \/>\n}<\/p>\n<\/div>\n<p>This streaming implementation ensures that tokens appear in real-time in the user interface, providing a smooth and responsive chat experience. You can also measure key performance metrics like time to first token and tokens per second.<\/p>\n<h3 class=\"wp-block-heading\">Performance measurement<\/h3>\n<p>You can measure various performance aspects of the model:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n\/\/ Record first token time<br \/>\nif firstTokenTime.IsZero() &amp;&amp; len(chunk.Choices) &gt; 0 &amp;&amp;<br \/>\nchunk.Choices[0].Delta.Content != &#8220;&#8221; {<br \/>\n    firstTokenTime = time.Now()\n<p>    \/\/ For llama.cpp, record prompt evaluation time<br \/>\n    if strings.Contains(strings.ToLower(model), &#8220;llama&#8221;) ||<br \/>\n       strings.Contains(apiBaseURL, &#8220;llama.cpp&#8221;) {<br \/>\n        promptEvalTime := firstTokenTime.Sub(promptEvalStartTime)<br \/>\n        llamacppPromptEvalTime.WithLabelValues(model).Observe(promptEvalTime.Seconds())<br \/>\n    }<br \/>\n}<\/p>\n<p>\/\/ Calculate tokens per second for llama.cpp metrics<br \/>\nif strings.Contains(strings.ToLower(model), &#8220;llama&#8221;) ||<br \/>\n   strings.Contains(apiBaseURL, &#8220;llama.cpp&#8221;) {<br \/>\n    totalTime := time.Since(firstTokenTime).Seconds()<br \/>\n    if totalTime &gt; 0 &amp;&amp; outputTokens &gt; 0 {<br \/>\n        tokensPerSecond := float64(outputTokens) \/ totalTime<br \/>\n        llamacppTokensPerSecond.WithLabelValues(model).Set(tokensPerSecond)<br \/>\n    }<br \/>\n}<\/p>\n<\/div>\n<p>These measurements help us understand the model\u2019s performance characteristics and optimize the user experience.<\/p>\n<h3 class=\"wp-block-heading\">Metrics collection<\/h3>\n<p>The metrics.go file is a core component of our observability stack for the Docker Model Runner-based chatbot. This file defines a comprehensive set of Prometheus metrics that allow us to monitor both the application performance and the underlying llama.cpp model behavior.<\/p>\n<h3 class=\"wp-block-heading\">Core metrics architecture<\/h3>\n<p>The file establishes a collection of Prometheus metric types:<\/p>\n<p><strong>Counters<\/strong>: For tracking cumulative values (like request counts, token counts)<\/p>\n<p><strong>Gauges<\/strong>: For tracking values that can increase and decrease (like active requests)<\/p>\n<p><strong>Histograms<\/strong>: For measuring distributions of values (like latencies)<\/p>\n<p>Each metric is created using the promauto factory, which automatically registers metrics with Prometheus.<\/p>\n<h2 class=\"wp-block-heading\">Categories of metrics<\/h2>\n<p>The metrics can be divided into three main categories:<\/p>\n<h3 class=\"wp-block-heading\">1. HTTP and application metrics<\/h3>\n<div class=\"wp-block-syntaxhighlighter-code \">\n\/\/ RequestCounter counts total HTTP requests<br \/>\nRequestCounter = promauto.NewCounterVec(<br \/>\n    prometheus.CounterOpts{<br \/>\n        Name: &#8220;genai_app_http_requests_total&#8221;,<br \/>\n        Help: &#8220;Total number of HTTP requests&#8221;,<br \/>\n    },<br \/>\n    []string{&#8220;method&#8221;, &#8220;endpoint&#8221;, &#8220;status&#8221;},<br \/>\n)\n<p>\/\/ RequestDuration measures HTTP request durations<br \/>\nRequestDuration = promauto.NewHistogramVec(<br \/>\n    prometheus.HistogramOpts{<br \/>\n        Name:    &#8220;genai_app_http_request_duration_seconds&#8221;,<br \/>\n        Help:    &#8220;HTTP request duration in seconds&#8221;,<br \/>\n        Buckets: prometheus.DefBuckets,<br \/>\n    },<br \/>\n    []string{&#8220;method&#8221;, &#8220;endpoint&#8221;},<br \/>\n)<\/p>\n<\/div>\n<p>These metrics monitor the HTTP server performance, tracking request counts, durations, and error rates. The metrics are labelled with dimensions like method, endpoint, and status to enable detailed analysis.<\/p>\n<h3 class=\"wp-block-heading\">2. Model performance metrics<\/h3>\n<div class=\"wp-block-syntaxhighlighter-code \">\n\/\/ ChatTokensCounter counts tokens in chat requests and responses<br \/>\nChatTokensCounter = promauto.NewCounterVec(<br \/>\n    prometheus.CounterOpts{<br \/>\n        Name: &#8220;genai_app_chat_tokens_total&#8221;,<br \/>\n        Help: &#8220;Total number of tokens processed in chat&#8221;,<br \/>\n    },<br \/>\n    []string{&#8220;direction&#8221;, &#8220;model&#8221;},<br \/>\n)\n<p>\/\/ ModelLatency measures model response time<br \/>\nModelLatency = promauto.NewHistogramVec(<br \/>\n    prometheus.HistogramOpts{<br \/>\n        Name:    &#8220;genai_app_model_latency_seconds&#8221;,<br \/>\n        Help:    &#8220;Model response time in seconds&#8221;,<br \/>\n        Buckets: []float64{0.1, 0.5, 1, 2, 5, 10, 20, 30, 60},<br \/>\n    },<br \/>\n    []string{&#8220;model&#8221;, &#8220;operation&#8221;},<br \/>\n)<\/p>\n<\/div>\n<p>These metrics track the LLM usage patterns and performance, including token counts (both input and output) and overall latency. The FirstTokenLatency metric is particularly important as it measures the time to get the first token from the model, which is a critical user experience factor.<\/p>\n<h3 class=\"wp-block-heading\">3. llama.cpp specific metrics<\/h3>\n<div class=\"wp-block-syntaxhighlighter-code \">\n\/\/ LlamaCppContextSize measures the context window size<br \/>\nLlamaCppContextSize = promauto.NewGaugeVec(<br \/>\n    prometheus.GaugeOpts{<br \/>\n        Name: &#8220;genai_app_llamacpp_context_size&#8221;,<br \/>\n        Help: &#8220;Context window size in tokens for llama.cpp models&#8221;,<br \/>\n    },<br \/>\n    []string{&#8220;model&#8221;},<br \/>\n)\n<p>\/\/ LlamaCppTokensPerSecond measures generation speed<br \/>\nLlamaCppTokensPerSecond = promauto.NewGaugeVec(<br \/>\n    prometheus.GaugeOpts{<br \/>\n        Name: &#8220;genai_app_llamacpp_tokens_per_second&#8221;,<br \/>\n        Help: &#8220;Tokens generated per second&#8221;,<br \/>\n    },<br \/>\n    []string{&#8220;model&#8221;},<br \/>\n)<\/p>\n<\/div>\n<p>These metrics capture detailed performance characteristics specific to the llama.cpp inference engine used by Docker Model Runner. They include:<\/p>\n<h4 class=\"wp-block-heading\">1. Context Size:\u00a0<\/h4>\n<p>It represents the token window size used by the model, typically ranging from 2048 to 8192 tokens. The optimization goal is balancing memory usage against conversation quality.\u00a0 When memory usage becomes problematic, reduce context size to 2048 tokens for faster processing<\/p>\n<h4 class=\"wp-block-heading\">2. Prompt Evaluation Time<\/h4>\n<p>It measures the time spent processing input before generating tokens, essentially your time-to-first-token latency with a target of under 2 seconds. The optimization focus is minimizing user wait time for the initial response. If evaluation time exceeds 3 seconds, reduce context size or implement prompt compression techniques.<\/p>\n<h4 class=\"wp-block-heading\">3. Tokens Per Second<\/h4>\n<p>It measures the time spent processing input before generating tokens, essentially your time-to-first-token latency with a target of under 2 seconds. The optimization focus is minimizing user wait time for the initial response. If evaluation time exceeds 3 seconds, reduce context size or implement prompt compression techniques.\u00a0<\/p>\n<h4 class=\"wp-block-heading\">4. Tokens Per Second<\/h4>\n<p>It indicates generation speed, with a target of 8+ TPS for good user experience. This metric requires balancing response speed with model quality. When TPS drops below 5, switch to more aggressive quantization (Q4 instead of Q8) or use a smaller model variant.\u00a0<\/p>\n<h4 class=\"wp-block-heading\">5. Memory Per Token<\/h4>\n<p>It tracks RAM consumption per generated token, with optimization aimed at preventing out-of-memory crashes and optimizing resource usage. When memory consumption exceeds 100MB per token, implement aggressive conversation pruning to reduce memory pressure. If memory usage grows over time during extended conversations, add automatic conversation resets after a set number of exchanges.\u00a0<\/p>\n<h4 class=\"wp-block-heading\">6. Threads Used<\/h4>\n<p>It monitors the number of CPU cores actively processing model operations, with the goal of maximizing throughput without overwhelming the system. If thread utilization falls below 50% of available cores, increase the thread count for better performance.<strong>\u00a0<\/strong><\/p>\n<h4 class=\"wp-block-heading\">7. Batch Size<\/h4>\n<p>It controls how many tokens are processed simultaneously, requiring optimization based on your specific use case balancing latency versus throughput. For real-time chat applications, use smaller batches of 32-64 tokens to minimize latency and provide faster response times.<\/p>\n<p>In nutshell, these metrics are crucial for understanding and optimizing llama.cpp performance characteristics, which directly affect the user experience of the chatbot.<\/p>\n<h2 class=\"wp-block-heading\">Docker Compose: LLM as a first-class service<\/h2>\n<p>With Docker Model Runner integration, Compose makes AI model deployment as simple as any other service. One docker-compose.yml file defines your entire AI application:<\/p>\n<p>Your AI models (via Docker Model Runner)<\/p>\n<p>Application backend and frontend<\/p>\n<p>Observability stack (Prometheus, Grafana, Jaeger)<\/p>\n<p>All networking and dependencies<\/p>\n<p>The most innovative aspect is the llm service using Docker\u2019s model provider, which simplifies model deployment by directly integrating with Docker Model Runner without requiring complex configuration. This composition creates a complete, scalable AI application stack with comprehensive observability.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n  llm:<br \/>\n    provider:<br \/>\n      type: model<br \/>\n      options:<br \/>\n        model: ${LLM_MODEL_NAME:-ai\/llama3.2:1B-Q8_0}\n<\/div>\n<p>\u200b\u200bThis configuration tells <a href=\"https:\/\/docs.docker.com\/compose\/\" target=\"_blank\">Docker Compose<\/a> to treat an AI model as a standard service in your application stack, just like a database or web server.\u00a0<\/p>\n<p>The provider syntax is Docker\u2019s new way of handling AI models natively. Instead of building containers or pulling images, Docker automatically manages the entire model-serving infrastructure for you.\u00a0<\/p>\n<p>The model: ${LLM_MODEL_NAME:-ai\/llama3.2:1B-Q8_0} line uses an environment variable with a fallback, meaning it will use whatever model you specify in LLM_MODEL_NAME, or default to Llama 3.2 1B if nothing is set.<\/p>\n<div class=\"wp-block-ponyo-table style__default\">\n<p>Docker Compose: One command to run your entire stack<\/p>\n<p>Why is this revolutionary? Before this, deploying an LLM required dozens of lines of complex configuration \u2013 custom Dockerfiles, GPU device mappings, volume mounts for model files, health checks, and intricate startup commands.<\/p>\n<p>Now, those four lines replace all of that complexity. Docker handles downloading the model, configuring the inference engine, setting up GPU access, and exposing the API endpoints automatically. Your other services can connect to the LLM using simple service names, making AI models as easy to use as any other infrastructure component. This transforms AI from a specialized deployment challenge into standard infrastructure-as-code.<\/p>\n<\/div>\n<p>Here\u2019s the full compose.yml file that orchestrates the entire application:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n services:<br \/>\n  backend:<br \/>\n    env_file: &#8216;backend.env&#8217;<br \/>\n    build:<br \/>\n      context: .<br \/>\n      target: backend<br \/>\n    ports:<br \/>\n      &#8211; &#8216;8080:8080&#8217;<br \/>\n      &#8211; &#8216;9090:9090&#8217;  # Metrics port<br \/>\n    volumes:<br \/>\n      &#8211; \/var\/run\/docker.sock:\/var\/run\/docker.sock  # Add Docker socket access<br \/>\n    healthcheck:<br \/>\n      test: [&#8216;CMD&#8217;, &#8216;wget&#8217;, &#8216;-qO-&#8216;, &#8216;http:\/\/localhost:8080\/health&#8217;]<br \/>\n      interval: 3s<br \/>\n      timeout: 3s<br \/>\n      retries: 3<br \/>\n    networks:<br \/>\n      &#8211; app-network<br \/>\n    depends_on:<br \/>\n      &#8211; llm\n<p>  frontend:<br \/>\n    build:<br \/>\n      context: .\/frontend<br \/>\n    ports:<br \/>\n      &#8211; &#8216;3000:3000&#8217;<br \/>\n    depends_on:<br \/>\n      backend:<br \/>\n        condition: service_healthy<br \/>\n    networks:<br \/>\n      &#8211; app-network<\/p>\n<p>  prometheus:<br \/>\n    image: prom\/prometheus:v2.45.0<br \/>\n    volumes:<br \/>\n      &#8211; .\/prometheus\/prometheus.yml:\/etc\/prometheus\/prometheus.yml<br \/>\n    command:<br \/>\n      &#8211; &#8216;&#8211;config.file=\/etc\/prometheus\/prometheus.yml&#8217;<br \/>\n      &#8211; &#8216;&#8211;storage.tsdb.path=\/prometheus&#8217;<br \/>\n      &#8211; &#8216;&#8211;web.console.libraries=\/etc\/prometheus\/console_libraries&#8217;<br \/>\n      &#8211; &#8216;&#8211;web.console.templates=\/etc\/prometheus\/consoles&#8217;<br \/>\n      &#8211; &#8216;&#8211;web.enable-lifecycle&#8217;<br \/>\n    ports:<br \/>\n      &#8211; &#8216;9091:9090&#8217;<br \/>\n    networks:<br \/>\n      &#8211; app-network<\/p>\n<p>  grafana:<br \/>\n    image: grafana\/grafana:10.1.0<br \/>\n    volumes:<br \/>\n      &#8211; .\/grafana\/provisioning:\/etc\/grafana\/provisioning<br \/>\n      &#8211; grafana-data:\/var\/lib\/grafana<br \/>\n    environment:<br \/>\n      &#8211; GF_SECURITY_ADMIN_PASSWORD=admin<br \/>\n      &#8211; GF_USERS_ALLOW_SIGN_UP=false<br \/>\n      &#8211; GF_SERVER_DOMAIN=localhost<br \/>\n    ports:<br \/>\n      &#8211; &#8216;3001:3000&#8217;<br \/>\n    depends_on:<br \/>\n      &#8211; prometheus<br \/>\n    networks:<br \/>\n      &#8211; app-network<\/p>\n<p>  jaeger:<br \/>\n    image: jaegertracing\/all-in-one:1.46<br \/>\n    environment:<br \/>\n      &#8211; COLLECTOR_ZIPKIN_HOST_PORT=:9411<br \/>\n    ports:<br \/>\n      &#8211; &#8216;16686:16686&#8217;  # UI<br \/>\n      &#8211; &#8216;4317:4317&#8217;    # OTLP gRPC<br \/>\n      &#8211; &#8216;4318:4318&#8217;    # OTLP HTTP<br \/>\n    networks:<br \/>\n      &#8211; app-network<\/p>\n<p>  # New LLM service using Docker Compose&#8217;s model provider<br \/>\n  llm:<br \/>\n    provider:<br \/>\n      type: model<br \/>\n      options:<br \/>\n        model: ${LLM_MODEL_NAME:-ai\/llama3.2:1B-Q8_0}<\/p>\n<p>volumes:<br \/>\n  grafana-data:<\/p>\n<p>networks:<br \/>\n  app-network:<br \/>\n    driver: bridge<\/p>\n<\/div>\n<p>This compose.yml defines a complete microservices architecture for the application with integrated observability tools and Model Runner support:<\/p>\n<p><strong>backend<\/strong><\/p>\n<p>Go-based API server with Docker socket access for container management<\/p>\n<p>Implements health checks and exposes both API (8080) and metrics (9090) ports<\/p>\n<p><strong>frontend<\/strong><\/p>\n<p>React-based user interface for an interactive chat experience<\/p>\n<p>Waits for backend health before starting to ensure system reliability<\/p>\n<p><strong>prometheus<\/strong><\/p>\n<p>Time-series metrics database for collecting and storing performance data<\/p>\n<p>Configured with custom settings for monitoring application behavior<\/p>\n<p><strong>grafana<\/strong><\/p>\n<p>Data visualization platform for metrics with persistent dashboard storage<\/p>\n<p>Pre-configured with admin access and connected to the Prometheus data source<\/p>\n<p><strong>jaeger<\/strong><\/p>\n<p>Distributed tracing system for visualizing request flows across services<\/p>\n<p>Supports multiple protocols (gRPC\/HTTP) with UI on port 16686<\/p>\n<h3 class=\"wp-block-heading\">How Docker Model Runner integration works<\/h3>\n<p>The project integrates with Docker Model Runner through the following mechanisms:<\/p>\n<p><strong>Connection Configuration<\/strong>:<\/p>\n<p>Using internal DNS: http:\/\/model-runner.docker.internal\/engines\/llama.cpp\/v1\/<\/p>\n<p>Using TCP via host-side support: localhost:12434<\/p>\n<p><strong>Docker\u2019s Host Networking<\/strong>:<\/p>\n<p>The extra_hosts configuration maps host.docker.internal to the host\u2019s gateway IP<\/p>\n<p><strong>Environment Variables<\/strong>:<\/p>\n<p>BASE_URL: URL for the model runner<\/p>\n<p>MODEL: Model identifier (e.g., ai\/llama3.2:1B-Q8_0)<\/p>\n<p><strong>API Communication<\/strong>:<\/p>\n<p>The backend formats messages and sends them to Docker Model Runner<\/p>\n<p>It then streams tokens back to the frontend in real-time<\/p>\n<h3 class=\"wp-block-heading\">Why this approach excels<\/h3>\n<p>Building GenAI applications with Docker Model Runner and comprehensive observability offers several advantages:<\/p>\n<p><strong>Privacy and Security<\/strong>: All data stays on your local infrastructure<\/p>\n<p><strong>Cost Control<\/strong>: No per-token or per-request API charges<\/p>\n<p><strong>Performance Insights<\/strong>: Deep visibility into model behavior and efficiency<\/p>\n<p><strong>Developer Experience<\/strong>: Familiar Docker-based workflow with powerful monitoring<\/p>\n<p><strong>Flexibility<\/strong>: Easy to experiment with different models and configurations<\/p>\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n<p>The genai-model-runner-metrics project demonstrates a powerful approach to building AI-powered applications with Docker Model Runner while maintaining visibility into performance characteristics. By combining local model execution with comprehensive metrics, you get the best of both worlds: the privacy and cost benefits of local execution with the observability needed for production applications.<\/p>\n<p>Whether you\u2019re building a customer support bot, a content generation tool, or a specialized AI assistant, this architecture provides the foundation for reliable, observable, and efficient AI applications. The metrics-driven approach ensures you can continuously monitor and optimize your application, leading to better user experiences and more efficient resource utilization.<\/p>\n<p>Ready to get started? Clone the repository, <a href=\"https:\/\/www.docker.com\/products\/docker-desktop\/\">fire up Docker Desktop<\/a>, and experience the future of AI development \u2014 your own local, metrics-driven GenAI application is just a docker compose up away!<\/p>\n\n\n<h3 class=\"wp-block-heading\">Learn more<\/h3>\n<p>Read our quickstart guide to <a href=\"https:\/\/www.docker.com\/blog\/run-llms-locally\/\">Docker Model Runner<\/a>.<\/p>\n<p>Find documentation for <a href=\"https:\/\/docs.docker.com\/model-runner\/\" target=\"_blank\">Model Runner<\/a>.<\/p>\n<p>Subscribe to the <a href=\"https:\/\/www.docker.com\/newsletter-subscription\/\">Docker Navigator Newsletter<\/a>.<\/p>\n<p>New to Docker? <a href=\"https:\/\/hub.docker.com\/signup?_gl=1*1v81gq1*_gcl_au*MTQxNjU3MjYxNS4xNzQyMjI1MTk2*_ga*MTMxODI0ODQ4LjE3NDE4MTI3NTA.*_ga_XJWPQMJYHQ*czE3NDg0NTYyNzIkbzI2JGcxJHQxNzQ4NDU2MzI2JGo2JGwwJGgw\" target=\"_blank\">Create an account<\/a>.\u00a0<\/p>\n<p>Have questions? The <a href=\"https:\/\/www.docker.com\/community\/\">Docker community is here to help<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Today, we\u2019ll show you how to build a fully functional Generative AI chatbot using Docker Model Runner and powerful observability [&hellip;]<\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[4],"tags":[],"class_list":["post-2088","post","type-post","status-publish","format-standard","hentry","category-docker"],"_links":{"self":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/2088","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/comments?post=2088"}],"version-history":[{"count":0,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/2088\/revisions"}],"wp:attachment":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media?parent=2088"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/categories?post=2088"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/tags?post=2088"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}