{"id":3426,"date":"2026-02-13T14:10:40","date_gmt":"2026-02-13T14:10:40","guid":{"rendered":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2026\/02\/13\/how-to-solve-the-context-size-issues-with-context-packing-with-docker-model-runner-and-agentic-compose\/"},"modified":"2026-02-13T14:10:40","modified_gmt":"2026-02-13T14:10:40","slug":"how-to-solve-the-context-size-issues-with-context-packing-with-docker-model-runner-and-agentic-compose","status":"publish","type":"post","link":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2026\/02\/13\/how-to-solve-the-context-size-issues-with-context-packing-with-docker-model-runner-and-agentic-compose\/","title":{"rendered":"How to solve the context size issues with context packing with Docker Model Runner and Agentic Compose"},"content":{"rendered":"<p>If you\u2019ve worked with local language models, you\u2019ve probably run into the context window limit, especially when using smaller models on less powerful machines. While it\u2019s an unavoidable constraint, techniques like context packing make it surprisingly manageable.<\/p>\n<p>Hello, I\u2019m Philippe, and I am a Principal Solutions Architect helping customers with their usage of Docker.\u00a0 In my previous <a href=\"https:\/\/www.docker.com\/blog\/making-small-llms-smarter\/\">blog post<\/a>, I wrote about how to make a very small model useful by using RAG. I had limited the message history to 2 to keep the context length short.<\/p>\n<p>But in some cases, you\u2019ll need to keep more messages in your history. For example, a long conversation to generate code:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; gutter: false; title: ; notranslate\">\n- generate an http server server in golang\n- add a human structure and a list of humans\n- add a handler to add a human to the list\n- add a handler to list all humans\n- add a handler to get a human by id\n- etc...\n\n<\/pre>\n<\/div>\n<p>Let\u2019s imagine we have a conversation for which we want to keep 10 messages in the history. Moreover, we\u2019re using a very verbose model (which a lot of tokens), so we\u2019ll quickly encounter this type of error:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; gutter: false; title: ; notranslate\">\nerror: {\n    code: 400,\n    message: 'request (8860 tokens) exceeds the available context size (8192 tokens), try increasing it',\n    type: 'exceed_context_size_error',\n    n_prompt_tokens: 8860,\n    n_ctx: 8192\n  },\n  code: 400,\n  param: undefined,\n  type: 'exceed_context_size_error'\n}\n\n<\/pre>\n<\/div>\n<p>What happened?<\/p>\n<h2 class=\"wp-block-heading\">Understanding context windows and their limits in local LLMs<\/h2>\n<p>Our LLM has a context window, which has a limited size. This means that if the conversation becomes too long\u2026 It will bug out.<\/p>\n<p>This window is the total number of <strong>tokens<\/strong> the model can process at once, like a short-term working memory.\u00a0 Read this <a href=\"https:\/\/www.ibm.com\/think\/topics\/context-window\" rel=\"nofollow noopener\" target=\"_blank\">IBM article<\/a> for a deep dive on context window<\/p>\n<div class=\"style-plain wp-block-ponyo-houston\">\n<p><em>In our example in the code snippet above, this size was set to 8192 tokens for LLM engines that power local LLM, like Docker Model Runner, Ollama, Llamacpp, \u2026<\/em><\/p>\n<\/div>\n<p>This window includes everything: <strong>system prompt<\/strong>, <strong>user message<\/strong>, <strong>history<\/strong>, <strong>injected documents,<\/strong> and the <strong>generated response<\/strong>. Refer to this <a href=\"https:\/\/redis.io\/blog\/llm-context-windows\/\" rel=\"nofollow noopener\" target=\"_blank\">Redis post<\/a> for more info.\u00a0<\/p>\n<p><strong>Example<\/strong>: if the model has 32k context, the sum (input + history + generated output) must remain \u2264 32k tokens. Learn more <a href=\"https:\/\/www.matterai.so\/blog\/understanding-llm-context-window\" rel=\"nofollow noopener\" target=\"_blank\">here<\/a>.\u00a0\u00a0<\/p>\n<p>It\u2019s possible to change the default context size (up or down) in the compose.yml file:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\">\nmodels:\n  chat-model:\n    model: hf.co\/qwen\/qwen2.5-coder-3b-instruct-gguf:q4_k_m\n    # Increased context size for better handling of larger inputs\n    context_size: 16384\n\n<\/pre>\n<\/div>\n<div class=\"style-plain wp-block-ponyo-houston\">\n<p><em>You can also do this with Docker with the following command: docker model configure \u2013context-size 8192 ai\/qwen2.5-coder `<\/em><\/p>\n<\/div>\n<p>And so we solve the problem, <strong>but only part of the problem<\/strong>. Indeed, it\u2019s not guaranteed that your model supports a larger context size (like 16384), and even if it does, it can very quickly degrade the model\u2019s performance.<\/p>\n<p>Thus, with hf.co\/qwen\/qwen2.5-coder-3b-instruct-gguf:q4_k_m, when the number of tokens in the context approaches 16384 tokens, <strong>generation can become (much) slower<\/strong> (at least on my machine). Again, this will depend on the model\u2019s capacity (read its documentation). And remember, the smaller the model, the harder it will be to handle a large context and stay focused.<\/p>\n<div class=\"style-plain wp-block-ponyo-houston\">\n<p><strong><em>Tips:<\/em><\/strong><em> always provide an option (a \/clear command for example) in your application to empty the message list, or to reduce it. Automatic or manual. Keep the initial system instructions though.<\/em><\/p>\n<\/div>\n<p>So we\u2019re at an impasse. How can we go further with our small models?<\/p>\n<p>Well, there is still a solution, which is called <strong>context packing<\/strong>.<\/p>\n<h2 class=\"wp-block-heading\">Using context packing to fit more information into limited context windows<\/h2>\n<p>We can\u2019t indefinitely increase the context size. To still manage to fit more information in the context, we can use a technique called <strong>\u201ccontext packing\u201d<\/strong>, which consists of having the model itself summarize previous messages (or entrust the task to another model), and <strong>replace the history<\/strong> with this summary and thus free up space in the context.<\/p>\n<p><strong>So we decide that from a certain token limit, we\u2019ll have the history of previous messages summarized, and replace this history with the generated summary.<\/strong><\/p>\n<p>I\u2019ve therefore modified my example to add a <strong>context packing<\/strong> step. For the exercise, I decided to use another model to do the summarization.<\/p>\n<h3 class=\"wp-block-heading\">Modification of the compose.yml file<\/h3>\n<p>I added a new model in the compose.yml file: ai\/qwen2.5:1.5B-F16<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\">\nmodels:\n  chat-model:\n    model: hf.co\/qwen\/qwen2.5-coder-3b-instruct-gguf:q4_k_m\n\n  embedding-model:\n    model: ai\/embeddinggemma:latest\n\n  context-packing-model:\n    model: ai\/qwen2.5:1.5B-F16\n\n<\/pre>\n<\/div>\n<p>Then:<\/p>\n<ul class=\"wp-block-list\">\n<li>I added the model in the models section of the service that runs our program.<\/li>\n<li>I increased the number of messages in the history to 10 (instead of 2 previously).<\/li>\n<li>I set a token limit at 5120 before triggering context compression.<\/li>\n<li>And finally, I defined instructions for the \u201ccontext packing\u201d model, asking it to summarize previous messages.<\/li>\n<\/ul>\n<p><em>excerpt from the service:<\/em><\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\">\ngolang-expert-v3:\nbuild:\n    context: .\n    dockerfile: Dockerfile\nenvironment:\n\n    HISTORY_MESSAGES: 10\n    TOKEN_LIMIT: 5120\n    # ...\n   \nconfigs:\n    - source: system.instructions.md\n    target: \/app\/system.instructions.md\n    - source: context-packing.instructions.md\n    target: \/app\/context-packing.instructions.md\n\nmodels:\n    chat-model:\n    endpoint_var: MODEL_RUNNER_BASE_URL\n    model_var: MODEL_RUNNER_LLM_CHAT\n\n    context-packing-model:\n    endpoint_var: MODEL_RUNNER_BASE_URL\n    model_var: MODEL_RUNNER_LLM_CONTEXT_PACKING\n\n    embedding-model:\n    endpoint_var: MODEL_RUNNER_BASE_URL\n    model_var: MODEL_RUNNER_LLM_EMBEDDING\n\n<\/pre>\n<\/div>\n<p><em>You\u2019ll find the complete version of the file here: <\/em><a href=\"https:\/\/codeberg.org\/k33g-blog\/docker-posts\/src\/branch\/main\/2026-01-19-smaller-context\/02-context-packing\/compose.yml\" rel=\"nofollow noopener\" target=\"_blank\"><em>compose.yml<\/em><\/a><\/p>\n<h3 class=\"wp-block-heading\">System instructions for the context packing model<\/h3>\n<p>Still in the compose.yml file, I added a new system instruction for the \u201ccontext packing\u201d model, in a context-packing.instructions.md file:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; gutter: false; title: ; notranslate\">\ncontext-packing.instructions.md:\ncontent: |\n    You are a context packing assistant.\n    Your task is to condense and summarize provided content to fit within token limits while preserving essential information.\n    Always:\n    - Retain key facts, figures, and concepts\n    - Remove redundant or less important details\n    - Ensure clarity and coherence in the condensed output\n    - Aim to reduce the token count significantly without losing critical information\n\n    The goal is to help fit more relevant information into a limited context window for downstream processing.\n\n<\/pre>\n<\/div>\n<p>All that\u2019s left is to implement the context packing logic in the assistant\u2019s code.<\/p>\n<h2 class=\"wp-block-heading\">\u00a0Applying context packing to the assistant\u2019s code<\/h2>\n<p>First, I define the connection with the context packing model in the Setup part of my assistant:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: jscript; title: ; notranslate\">\nconst contextPackingModel = new ChatOpenAI({\n  model: process.env.MODEL_RUNNER_LLM_CONTEXT_PACKING || `ai\/qwen2.5:1.5B-F16`,\n  apiKey: \"\",\n  configuration: {\n    baseURL: process.env.MODEL_RUNNER_BASE_URL || \"http:\/\/localhost:12434\/engines\/llama.cpp\/v1\/\",\n  },\n  temperature: 0.0,\n  top_p: 0.9,\n  presencePenalty: 2.2,\n});\n\n<\/pre>\n<\/div>\n<p>I also retrieve the system instructions I defined for this model, as well as the token limit:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: jscript; gutter: false; title: ; notranslate\">\nlet contextPackingInstructions = fs.readFileSync('\/app\/context-packing.instructions.md', 'utf8');\n\nlet tokenLimit = parseInt(process.env.TOKEN_LIMIT) || 7168\n\n<\/pre>\n<\/div>\n<p>Once in the conversation loop, I\u2019ll estimate the <strong>number of tokens consumed by previous messages<\/strong>, and if this number exceeds the <strong>defined limit<\/strong>, I\u2019ll call the context packing model to summarize the history of previous messages and replace this history with the generated summary (the assistant-type message: [\u201cassistant\u201d, summary]). Then I continue generating the response using the main model.<\/p>\n<p><em>excerpt from the conversation loop:<\/em><\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: jscript; title: ; notranslate\">\n let estimatedTokenCount = messages.reduce((acc, [role, content]) =&gt; acc + Math.ceil(content.length \/ 4), 0);\n  console.log(` Estimated token count for messages: ${estimatedTokenCount} tokens`);\n\n  if (estimatedTokenCount &gt;= tokenLimit) {\n    console.log(` Warning: Estimated token count (${estimatedTokenCount}) exceeds the model's context limit (${tokenLimit}). Compressing conversation history...`);\n\n    \/\/ Calculate original history size\n    const originalHistorySize = history.reduce((acc, [role, content]) =&gt; acc + Math.ceil(content.length \/ 4), 0);\n\n    \/\/ Prepare messages for context packing\n    const contextPackingMessages = [\n      [\"system\", contextPackingInstructions],\n      ...history,\n      [\"user\", \"Please summarize the above conversation history to reduce its size while retaining important information.\"]\n    ];\n\n    \/\/ Generate summary using context packing model\n    console.log(\" Generating summary with context packing model...\");\n    let summary = '';\n    const summaryStream = await contextPackingModel.stream(contextPackingMessages);\n    for await (const chunk of summaryStream) {\n      summary += chunk.content;\n      process.stdout.write('x1b[32m' + chunk.content + 'x1b[0m');\n    }\n    console.log();\n\n    \/\/ Calculate compressed size\n    const compressedSize = Math.ceil(summary.length \/ 4);\n    const reductionPercentage = ((originalHistorySize - compressedSize) \/ originalHistorySize * 100).toFixed(2);\n\n    console.log(` History compressed: ${originalHistorySize} tokens \u2192 ${compressedSize} tokens (${reductionPercentage}% reduction)`);\n\n    \/\/ Replace all history with the summary\n    conversationMemory.set(\"default-session-id\", [[\"assistant\", summary]]);\n\n    estimatedTokenCount = compressedSize\n\n    \/\/ Rebuild messages with compressed history\n    messages = [\n      [\"assistant\", summary],\n      [\"system\", systemInstructions],\n      [\"system\", knowledgeBase],\n      [\"user\", userMessage]\n    ];\n  }\n\n<\/pre>\n<\/div>\n<p><em>You\u2019ll find the complete version of the code here: <\/em><a href=\"https:\/\/codeberg.org\/k33g-blog\/docker-posts\/src\/branch\/main\/2026-01-19-smaller-context\/02-context-packing\/index.js\" rel=\"nofollow noopener\" target=\"_blank\"><em>index.js<\/em><\/a><\/p>\n<p>All that\u2019s left is to test our assistant and have it hold a long conversation, to see context packing in action.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: bash; gutter: false; title: ; notranslate\">\ndocker compose up --build -d\ndocker compose exec golang-expert-v3 node index.js\n\n<\/pre>\n<\/div>\n<p>And after a while in the conversation, you should see the warning message about the token limit, followed by the summary generated by the context packing model, and finally, the reduction in the number of tokens in the history:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\">\nEstimated token count for messages: 5984 tokens\nWarning: Estimated token count (5984) exceeds the model's context limit (5120). Compressing conversation history...\nGenerating summary with context packing model...\nSure, here's a summary of the conversation:\n\n1. The user asked for an example in Go of creating an HTTP server.\n2. The assistant provided a simple example in Go that creates an HTTP server and handles GET requests to display \"Hello, World!\".\n3. The user requested an equivalent example in Java.\n4. The assistant presented a Java implementation that uses the `java.net.http` package to create an HTTP server and handle incoming requests.\n\nThe conversation focused on providing examples of creating HTTP servers in both Go and Java, with the goal of reducing the token count while retaining essential information.\nHistory compressed: 4886 tokens \u2192 153 tokens (96.87% reduction)\n\n<\/pre>\n<\/div>\n<p>This way, we ensure that our assistant can handle a long conversation while maintaining good generation performance.<\/p>\n<h2 class=\"wp-block-heading\">Summary<\/h2>\n<p>The context window is an unavoidable constraint when working with local language models, particularly with small models and on machines with limited resources. However, by using techniques like context packing, you can easily work around this limitation. Using Docker Model Runner and Agentic Compose, you can implement this pattern to support long, verbose conversations without overwhelming your model.<\/p>\n<p>All the source code is available on Codeberg: <a href=\"https:\/\/codeberg.org\/k33g-blog\/docker-posts\/src\/branch\/main\/2026-01-19-smaller-context\/02-context-packing\" rel=\"nofollow noopener\" target=\"_blank\">context-packing<\/a>. Give it a try!\u00a0<\/p>","protected":false},"excerpt":{"rendered":"<p>If you\u2019ve worked with local language models, you\u2019ve probably run into the context window limit, especially when using smaller models [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":94,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[4],"tags":[],"class_list":["post-3426","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-docker"],"_links":{"self":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/3426","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/comments?post=3426"}],"version-history":[{"count":0,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/3426\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media\/94"}],"wp:attachment":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media?parent=3426"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/categories?post=3426"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/tags?post=3426"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}