{"id":3259,"date":"2026-01-16T14:20:27","date_gmt":"2026-01-16T14:20:27","guid":{"rendered":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2026\/01\/16\/making-very-small-llms-smarter\/"},"modified":"2026-01-16T14:20:27","modified_gmt":"2026-01-16T14:20:27","slug":"making-very-small-llms-smarter","status":"publish","type":"post","link":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2026\/01\/16\/making-very-small-llms-smarter\/","title":{"rendered":"Making (Very) Small LLMs Smarter"},"content":{"rendered":"<p>Hello, I\u2019m Philippe, and I am a Principal Solutions Architect helping customers with their usage of Docker. I started getting seriously interested in generative AI about two years ago. What interests me most is the ability to run language models (LLMs) directly on my laptop (For work, I have a MacBook Pro M2 max, but on a more personal level, I run LLMs on my personal MacBook Air M4 and on Raspberry Pis \u2013 yes, it\u2019s possible, but I\u2019ll talk about that another time).<\/p>\n<p>Let\u2019s be clear, reproducing a Claude AI Desktop or Chat GPT on a laptop with small language models is not possible. Especially since I limit myself to models that have between 0.5 and 7 billion parameters. But I find it an interesting challenge to see how far we can go with these small models. So, can we do really useful things with small LLMs? The answer is yes, but you need to be creative and put in a bit of effort.<\/p>\n<p>I\u2019m going to take a concrete use case, related to development (but in the future I\u2019ll propose \u201cless technical\u201d use cases).<\/p>\n<h2 class=\"wp-block-heading\">(Specific) Use Case: Code Writing Assistance<\/h2>\n<h3 class=\"wp-block-heading\">I need help writing code<\/h3>\n<p>Currently, I\u2019m working in my free time on an open-source project, which is a Golang library for quickly developing small generative AI agents. It\u2019s both to get my hands dirty with Golang and prepare tools for other projects. This project is called <strong>Nova<\/strong>; there\u2019s nothing secret about it, you can find it <a href=\"https:\/\/github.com\/SnipWise\/nova\" rel=\"nofollow noopener\" target=\"_blank\">here<\/a>.<\/p>\n<p>If I use <strong>Claude AI<\/strong> and ask it to help me write code with <strong>Nova<\/strong>: <em>\u201cI need a code snippet of a Golang Nova Chat agent using a stream completion.\u201d<\/em><\/p>\n<div class=\"wp-block-ponyo-image\">\n                <img data-opt-id=2136621886  fetchpriority=\"high\" decoding=\"async\" width=\"1000\" height=\"743\" src=\"https:\/\/www.docker.com\/app\/uploads\/2026\/01\/tiny-model-fig-1.png\" class=\"fade-in attachment-full size-full\" alt=\"tiny model fig 1\" title=\"- tiny model fig 1\" \/>\n        <\/div>\n<p>The response will be quite disappointing, because <strong>Claude<\/strong> doesn\u2019t know <strong>Nova<\/strong> (which is normal, it\u2019s a recent project). But Claude doesn\u2019t want to disappoint me and will still propose something which has nothing to do with my project.<\/p>\n<p>And it will be the same with <strong>Gemini<\/strong>.<\/p>\n<div class=\"wp-block-ponyo-image\">\n                <img data-opt-id=1069490471  fetchpriority=\"high\" decoding=\"async\" width=\"1000\" height=\"957\" src=\"https:\/\/www.docker.com\/app\/uploads\/2026\/01\/tiny-model-fig-2.png\" class=\"fade-in attachment-full size-full\" alt=\"tiny model fig 2\" title=\"- tiny model fig 2\" \/>\n        <\/div>\n<p>So, you\u2019ll tell me, give the \u201csource code of your repository to feed\u201d to Claude AI or Gemini. OK, but imagine the following situation: I don\u2019t have access to these services, for various reasons. Some of these reasons could be confidentiality, the fact that I\u2019m on a project where we don\u2019t have the right to use the internet, for example. That already disqualifies Claude AI and Gemini. How can I get help writing code with a small local LLM? So as you guessed, with a local LLM. And moreover, a \u201cvery small\u201d LLM.<\/p>\n<h2 class=\"wp-block-heading\">Choosing a language model<\/h2>\n<p>When you develop a solution based on generative AI, the choice of language model(s) is crucial. And you\u2019ll have to do a lot of technology watching, research, and testing to find the model that best fits your use case. And know that this is non-negligible work.<\/p>\n<p>For this article (and also because I use it), I\u2019m going to use hf.co\/qwen\/qwen2.5-coder-3b-instruct-gguf:q4_k_m, which you can find <a href=\"https:\/\/huggingface.co\/qwen\/qwen2.5-coder-3b-instruct-gguf\" rel=\"nofollow noopener\" target=\"_blank\">here<\/a>. It\u2019s a 3 billion parameter language model, optimized for code generation. You can install it with Docker Model Runner with the following command:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: bash; gutter: false; title: ; notranslate\">\ndocker model pull hf.co\/Qwen\/Qwen2.5-Coder-3B-Instruct-GGUF:Q4_K_M\n<\/pre>\n<\/div>\n<p>And to start chatting with the model, you can use the following command:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: bash; gutter: false; title: ; notranslate\">\ndocker model run hf.co\/qwen\/qwen2.5-coder-3b-instruct-gguf:q4_k_m\n<\/pre>\n<\/div>\n<p>Or use Docker Desktop:<\/p>\n<div class=\"wp-block-ponyo-image\">\n                <img data-opt-id=1710409829  data-opt-src=\"https:\/\/www.docker.com\/app\/uploads\/2026\/01\/tiny-model-fig-3.png\"  decoding=\"async\" width=\"1000\" height=\"623\" src=\"data:image/svg+xml,%3Csvg%20viewBox%3D%220%200%20100%%20100%%22%20width%3D%22100%%22%20height%3D%22100%%22%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%3E%3Crect%20width%3D%22100%%22%20height%3D%22100%%22%20fill%3D%22transparent%22%2F%3E%3C%2Fsvg%3E\" class=\"fade-in attachment-full size-full\" alt=\"tiny model fig 3\" title=\"- tiny model fig 3\" \/>\n        <\/div>\n<p>So, of course, as you can see in the illustration above, this little \u201c<strong>Qwen Coder<\/strong>\u201d doesn\u2019t know my <strong>Nova<\/strong> library either. But we\u2019re going to fix that.<\/p>\n<h2 class=\"wp-block-heading\">Feeding the model with specific information<\/h2>\n<p>For my project, I have a markdown file in which I save the code snippets I use to develop examples with <strong>Nova<\/strong>. You can find it <a href=\"https:\/\/codeberg.org\/k33g-blog\/docker-posts\/src\/branch\/main\/2026-01-12-make-tlm-smarter-js\/02-streaming-completion-and-rag\/data\/snippets.md\" rel=\"nofollow noopener\" target=\"_blank\">here<\/a>. For now, there\u2019s little content, but it will be enough to prove and illustrate my point.<\/p>\n<p>So I could add the entire content of this file to a user prompt that I would give to the model. But that will be ineffective. Indeed, small models have a relatively small context window. But even if my \u201c<strong>Qwen Coder<\/strong>\u201d was capable of ingesting all the content of my markdown file, it would have trouble focusing on my request and on what it should do with this information. So,<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>1st essential rule: when you use a very small LLM, the larger the content provided to the model, the less effective the model will be.<\/strong><\/li>\n<li>2nd essential rule: the more you keep the conversation history, the more the content provided to the model will grow, and therefore it will decrease the effectiveness of the model.<\/li>\n<\/ul>\n<p>So, to work around this problem, I\u2019m going to use a technique called <strong>RAG<\/strong> (Retrieval Augmented Generation). The principle is simple: instead of providing all the content to the model, we\u2019re going to store this content in a \u201cvector\u201d type database, and when the user makes a request, we\u2019re going to search in this database for the most relevant information based on the user\u2019s request. Then, we\u2019re going to provide only this relevant information to the language model. For this blog post, the data will be kept in memory (which is not optimal, but sufficient for a demonstration).<\/p>\n<h3 class=\"wp-block-heading\">RAG?<\/h3>\n<p>There are already many articles on the subject, so I won\u2019t go into detail. But here\u2019s what I\u2019m going to do for this blog post:<\/p>\n<ol class=\"wp-block-list\">\n<li>My snippets file is composed of sections: a markdown title (## snippet name), possibly a description in free text, and a code block (golang \u2026 ).<\/li>\n<li>I\u2019m going to split this file by sections into chunks of text (we also talk about \u201cchunks\u201d),<\/li>\n<li>Then, <strong>for each section<\/strong> I\u2019m going to create an <strong>\u201cembedding\u201d<\/strong> (vector representation of text == mathematical representation of the semantic meaning of the text) with the ai\/embeddinggemma:latest model (a relatively small and efficient embedding model). Then I\u2019m going to store these embeddings (<strong>and the associated text<\/strong>) in an in-memory vector database (a simple array of JSON objects).<\/li>\n<li>If you want to learn more about embedding, please read this article:<a href=\"https:\/\/www.docker.com\/blog\/run-embedding-models-for-semantic-search\/\">Run Embedding Models and Unlock Semantic Search with Docker Model Runner<\/a><\/li>\n<\/ol>\n<p><strong>Diagram of the vector database creation process:<\/strong><\/p>\n<div class=\"wp-block-ponyo-image\">\n                <img data-opt-id=1065371109  data-opt-src=\"https:\/\/www.docker.com\/app\/uploads\/2026\/01\/tiny-model-fig-4.png\"  decoding=\"async\" width=\"1000\" height=\"786\" src=\"data:image/svg+xml,%3Csvg%20viewBox%3D%220%200%20100%%20100%%22%20width%3D%22100%%22%20height%3D%22100%%22%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%3E%3Crect%20width%3D%22100%%22%20height%3D%22100%%22%20fill%3D%22transparent%22%2F%3E%3C%2Fsvg%3E\" class=\"fade-in attachment-full size-full\" alt=\"tiny model fig 4\" title=\"- tiny model fig 4\" \/>\n        <\/div>\n<h3 class=\"wp-block-heading\">Similarity search and user prompt construction<\/h3>\n<p>Once I have this in place, when I make a request to the language model (so hf.co\/qwen\/qwen2.5-coder-3b-instruct-gguf:q4_k_m), I\u2019m going to:<\/p>\n<ol class=\"wp-block-list\">\n<li>Create an embedding of the user\u2019s request with the embedding model.<\/li>\n<li>Compare this embedding with the embeddings stored in the vector database to find the most relevant sections (by calculating the distance between the vector representation of my question and the vector representations of the snippets). This is called a similarity search.<\/li>\n<li>From the most relevant sections (the most similar), I\u2019ll be able to construct a user prompt that includes only the relevant information and my initial request.<\/li>\n<\/ol>\n<p><strong>Diagram of the search and user prompt construction process:<\/strong><\/p>\n<div class=\"wp-block-ponyo-image\">\n                <img data-opt-id=1857428646  data-opt-src=\"https:\/\/www.docker.com\/app\/uploads\/2026\/01\/tiny-model-fig-5.png\"  decoding=\"async\" width=\"1000\" height=\"851\" src=\"data:image/svg+xml,%3Csvg%20viewBox%3D%220%200%20100%%20100%%22%20width%3D%22100%%22%20height%3D%22100%%22%20xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%22%3E%3Crect%20width%3D%22100%%22%20height%3D%22100%%22%20fill%3D%22transparent%22%2F%3E%3C%2Fsvg%3E\" class=\"fade-in attachment-full size-full\" alt=\"tiny model fig 5\" title=\"- tiny model fig 5\" \/>\n        <\/div>\n<p>So the final user prompt will contain:<\/p>\n<ul class=\"wp-block-list\">\n<li>The system instructions. For example: <em>\u201cYou are a helpful coding assistant specialized in Golang and the Nova library. Use the provided code snippets to help the user with their requests.\u201d<\/em><\/li>\n<li>The relevant sections were extracted from the vector database.<\/li>\n<li>The user\u2019s request.<\/li>\n<\/ul>\n<p><strong>Remarks<\/strong>:<\/p>\n<ul class=\"wp-block-list\">\n<li>I explain the principles and results, but all the source code (NodeJS with LangchainJS) used to arrive at my conclusions is available in this <a href=\"https:\/\/codeberg.org\/k33g-blog\/docker-posts\/src\/branch\/main\/2026-01-12-make-tlm-smarter-js\/02-streaming-completion-and-rag\" rel=\"nofollow noopener\" target=\"_blank\">project<\/a>\u00a0<\/li>\n<li>To calculate distances between vectors, I used <a href=\"https:\/\/en.wikipedia.org\/wiki\/Cosine_similarity\" rel=\"nofollow noopener\" target=\"_blank\"><strong>cosine similarity<\/strong><\/a> (A cosine similarity score of 1 indicates that the vectors point in the same direction. A cosine similarity score of 0 indicates that the vectors are orthogonal, meaning they have no directional similarity.)<\/li>\n<li>You can find the JavaScript function I used <a href=\"https:\/\/codeberg.org\/k33g-blog\/docker-posts\/src\/branch\/main\/2026-01-12-make-tlm-smarter-js\/02-streaming-completion-and-rag\/rag.js#L12\" rel=\"nofollow noopener\" target=\"_blank\">here<\/a>:\u00a0<\/li>\n<li>And the piece of code that I use to split the <a href=\"https:\/\/codeberg.org\/k33g-blog\/docker-posts\/src\/branch\/main\/2026-01-12-make-tlm-smarter-js\/02-streaming-completion-and-rag\/chunks.js\" rel=\"nofollow noopener\" target=\"_blank\">markdown snippets file<\/a>:\u00a0<\/li>\n<li><strong>Warning<\/strong>: embedding models are limited by the size of text chunks they can ingest. So you have to be careful not to exceed this size when splitting the source file. And in some cases, you\u2019ll have to change the splitting strategy (fixed-size chunk,s for example, with or without overlap)<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\">Implementation and results, or creating my Golang expert agent<\/h2>\n<p>Now that we have the operating principle, let\u2019s see how to put this into music with <strong>LangchainJS<\/strong>, <strong>Docker Model Runner,<\/strong> and <strong>Docker Agentic Compose<\/strong>.<\/p>\n<h3 class=\"wp-block-heading\">Docker Agentic Compose configuration<\/h3>\n<p>Let\u2019s start with the Docker Agentic Compose project structure:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\">\nservices:\n  golang-expert:\n    build:\n      context: .\n      dockerfile: Dockerfile\n    environment:\n      TERM: xterm-256color\n\n      HISTORY_MESSAGES: 2\n      MAX_SIMILARITIES: 3\n      COSINE_LIMIT: 0.45\n\n      OPTION_TEMPERATURE: 0.0\n      OPTION_TOP_P: 0.75\n      OPTION_PRESENCE_PENALTY: 2.2\n\n      CONTENT_PATH: \/app\/data\n\n    volumes:\n      - .\/data:\/app\/data\n\n    stdin_open: true   # docker run -i\n    tty: true          # docker run -t\n\n    configs:\n      - source: system.instructions.md\n        target: \/app\/system.instructions.md\n\n    models:\n      chat-model:\n        endpoint_var: MODEL_RUNNER_BASE_URL\n        model_var: MODEL_RUNNER_LLM_CHAT\n\n      embedding-model:\n        endpoint_var: MODEL_RUNNER_BASE_URL\n        model_var: MODEL_RUNNER_LLM_EMBEDDING\n\n\nmodels:\n  chat-model:\n    model: hf.co\/qwen\/qwen2.5-coder-3b-instruct-gguf:q4_k_m\n\n  embedding-model:\n    model: ai\/embeddinggemma:latest\n\nconfigs:\n  system.instructions.md:\n    content: |\n      Your name is Bob (the original replicant).\n      You are an expert programming assistant in Golang.\n      You write clean, efficient, and well-documented code.\n      Always:\n      - Provide complete, working code\n      - Include error handling\n      - Add helpful comments\n      - Follow best practices for the language\n      - Explain your approach briefly\n\n      Use only the information available in the provided data and your KNOWLEDGE BASE.\n\n<\/pre>\n<\/div>\n<p>What\u2019s important here is:<\/p>\n<p>I only keep the last 2 messages in my conversation history, and I only select the 2 or 3 best similarities found at most (to limit the size of the user prompt):<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; gutter: false; title: ; notranslate\">\nHISTORY_MESSAGES: 2\nMAX_SIMILARITIES: 3\nCOSINE_LIMIT: 0.45\n<\/pre>\n<\/div>\n<p>You can adjust these values according to your use case and your language model\u2019s capabilities.<\/p>\n<p>The models section, where I define the language models I\u2019m going to use:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\">\nmodels:\n  chat-model:\n    model: hf.co\/qwen\/qwen2.5-coder-3b-instruct-gguf:q4_k_m\n\n  embedding-model:\n    model: ai\/embeddinggemma:latest\n<\/pre>\n<\/div>\n<p>One of the advantages of this section is that it will allow Docker Compose to download the models if they\u2019re not already present on your machine.<\/p>\n<p>As well as the models section of the golang-expert service, where I map the environment variables to the models defined above:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\">\nmodels:\n    chat-model:\n    endpoint_var: MODEL_RUNNER_BASE_URL\n    model_var: MODEL_RUNNER_LLM_CHAT\n\n    embedding-model:\n    endpoint_var: MODEL_RUNNER_BASE_URL\n    model_var: MODEL_RUNNER_LLM_EMBEDDING\n\n<\/pre>\n<\/div>\n<p>And finally, the system instructions configuration file:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; gutter: false; title: ; notranslate\">\nconfigs:\n    - source: system.instructions.md\n    target: \/app\/system.instructions.md\n<\/pre>\n<\/div>\n<p>Which I define a bit further down in the configs section:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\">\nconfigs:\n  system.instructions.md:\n    content: |\n      Your name is Bob (the original replicant).\n      You are an expert programming assistant in Golang.\n      You write clean, efficient, and well-documented code.\n      Always:\n      - Provide complete, working code\n      - Include error handling\n      - Add helpful comments\n      - Follow best practices for the language\n      - Explain your approach briefly\n\n      Use only the information available in the provided data and your KNOWLEDGE BASE.\n\n<\/pre>\n<\/div>\n<p>You can, of course, adapt these system instructions to your use case. And also persist them in a separate file if you prefer.<\/p>\n<h4 class=\"wp-block-heading\">Dockerfile<\/h4>\n<p>It\u2019s rather simple:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\">\nFROM node:22.19.0-trixie\n\nWORKDIR \/app\nCOPY package*.json .\/\nRUN npm install\nCOPY *.js .\n\n# Create non-root user\nRUN groupadd --gid 1001 nodejs &amp;amp;&amp;amp; \n    useradd --uid 1001 --gid nodejs --shell \/bin\/bash --create-home bob-loves-js\n\n# Change ownership of the app directory\nRUN chown -R bob-loves-js:nodejs \/app\n\n# Switch to non-root user\nUSER bob-loves-js\n\n<\/pre>\n<\/div>\n<p>Now that the configuration is in place, let\u2019s move on to the agent\u2019s source code.<\/p>\n<h3 class=\"wp-block-heading\">Golang expert agent source code, a bit of LangchainJS with RAG<\/h3>\n<p>The JavaScript code is rather simple (probably improvable, but functional) and follows these main steps:<\/p>\n<p><strong>1. Initial configuration<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>Connection to both models (chat and embeddings) via LangchainJS<\/li>\n<li>Loading parameters from environment variables<\/li>\n<\/ul>\n<p><strong>2. Vector database creation (at startup)<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>Reading the snippets.md file<\/li>\n<li>Splitting into sections (chunks)<\/li>\n<li>Generating an embedding for each section<\/li>\n<li>Storing in an in-memory vector database<\/li>\n<\/ul>\n<p><strong>3. Interactive conversation loop<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>The user asks a question<\/li>\n<li>Creating an embedding of the question<\/li>\n<li>Similarity search in the vector database to find the most relevant snippets<\/li>\n<li>Construction of the final prompt with: history + system instructions + relevant snippets + question<\/li>\n<li>Sending to the LLM and displaying the response in streaming<\/li>\n<li>Updating the history (limited to the last N messages)<\/li>\n<\/ul>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: jscript; title: ; notranslate\">\nimport { ChatOpenAI } from \"@langchain\/openai\";\nimport { OpenAIEmbeddings} from '@langchain\/openai';\n\nimport { splitMarkdownBySections } from '.\/chunks.js'\nimport { VectorRecord, MemoryVectorStore } from '.\/rag.js';\n\n\nimport prompts from \"prompts\";\nimport fs from 'fs';\n\n\/\/ Define [CHAT MODEL] Connection\nconst chatModel = new ChatOpenAI({\n  model: process.env.MODEL_RUNNER_LLM_CHAT || `ai\/qwen2.5:latest`,\n  apiKey: \"\",\n  configuration: {\n    baseURL: process.env.MODEL_RUNNER_BASE_URL || \"http:\/\/localhost:12434\/engines\/llama.cpp\/v1\/\",\n  },\n  temperature: parseFloat(process.env.OPTION_TEMPERATURE) || 0.0,\n  top_p: parseFloat(process.env.OPTION_TOP_P) || 0.5,\n  presencePenalty: parseFloat(process.env.OPTION_PRESENCE_PENALTY) || 2.2,\n});\n\n\n\/\/ Define [EMBEDDINGS MODEL] Connection\nconst embeddingsModel = new OpenAIEmbeddings({\n    model: process.env.MODEL_RUNNER_LLM_EMBEDDING || \"ai\/embeddinggemma:latest\",\n    configuration: {\n    baseURL: process.env.MODEL_RUNNER_BASE_URL || \"http:\/\/localhost:12434\/engines\/llama.cpp\/v1\/\",\n        apiKey: \"\"\n    }\n})\n\nconst maxSimilarities = parseInt(process.env.MAX_SIMILARITIES) || 3\nconst cosineLimit = parseFloat(process.env.COSINE_LIMIT) || 0.45\n\n\/\/ ----------------------------------------------------------------\n\/\/  Create the embeddings and the vector store from the content file\n\/\/ ----------------------------------------------------------------\n\nconsole.log(\"========================================================\")\nconsole.log(\" Embeddings model:\", embeddingsModel.model)\nconsole.log(\" Creating embeddings...\")\nlet contentPath = process.env.CONTENT_PATH || \".\/data\"\n\nconst store = new MemoryVectorStore();\n\nlet contentFromFile = fs.readFileSync(contentPath+\"\/snippets.md\", 'utf8');\nlet chunks = splitMarkdownBySections(contentFromFile);\nconsole.log(\" Number of documents read from file:\", chunks.length);\n\n\n\/\/ -------------------------------------------------\n\/\/ Create and save the embeddings in the memory vector store\n\/\/ -------------------------------------------------\nconsole.log(\" Creating the embeddings...\");\n\nfor (const chunk of chunks) {\n  try {\n    \/\/ EMBEDDING COMPLETION:\n    const chunkEmbedding = await embeddingsModel.embedQuery(chunk);\n    const vectorRecord = new VectorRecord('', chunk, chunkEmbedding);\n    store.save(vectorRecord);\n\n  } catch (error) {\n    console.error(`Error processing chunk:`, error);\n  }\n}\n\nconsole.log(\" Embeddings created, total of records\", store.records.size);\nconsole.log();\n\n\nconsole.log(\"========================================================\")\n\n\n\/\/ Load the system instructions from a file\nlet systemInstructions = fs.readFileSync('\/app\/system.instructions.md', 'utf8');\n\n\/\/ ----------------------------------------------------------------\n\/\/ HISTORY: Initialize a Map to store conversations by session\n\/\/ ----------------------------------------------------------------\nconst conversationMemory = new Map()\n\nlet exit = false;\n\n\/\/ CHAT LOOP:\nwhile (!exit) {\n  const { userMessage } = await prompts({\n    type: \"text\",\n    name: \"userMessage\",\n    message: `Your question (${chatModel.model}): `,\n    validate: (value) =&gt; (value ? true : \"Question cannot be empty\"),\n  });\n\n  if (userMessage == \"\/bye\") {\n    console.log(\" See you later!\");\n    exit = true;\n    continue\n  }\n\n  \/\/ HISTORY: Get the conversation history for this session\n  const history = getConversationHistory(\"default-session-id\")\n\n  \/\/ ----------------------------------------------------------------\n  \/\/ SIMILARITY SEARCH:\n  \/\/ ----------------------------------------------------------------\n  \/\/ -------------------------------------------------\n  \/\/ Create embedding from the user question\n  \/\/ -------------------------------------------------\n  const userQuestionEmbedding = await embeddingsModel.embedQuery(userMessage);\n\n  \/\/ -------------------------------------------------\n  \/\/ Use the vector store to find similar chunks\n  \/\/ -------------------------------------------------\n  \/\/ Create a vector record from the user embedding\n  const embeddingFromUserQuestion = new VectorRecord('', '', userQuestionEmbedding);\n\n  const similarities = store.searchTopNSimilarities(embeddingFromUserQuestion, cosineLimit, maxSimilarities);\n\n  let knowledgeBase = \"KNOWLEDGE BASE:n\";\n\n  for (const similarity of similarities) {\n    console.log(\" CosineSimilarity:\", similarity.cosineSimilarity, \"Chunk:\", similarity.prompt);\n    knowledgeBase += `${similarity.prompt}n`;\n  }\n\n  console.log(\"n Similarities found, total of records\", similarities.length);\n  console.log();\n  console.log(\"========================================================\")\n  console.log()\n\n  \/\/ -------------------------------------------------\n  \/\/ Generate CHAT COMPLETION:\n  \/\/ -------------------------------------------------\n\n  \/\/ MESSAGES== PROMPT CONSTRUCTION:\n  let messages = [\n      ...history,\n      [\"system\", systemInstructions],\n      [\"system\", knowledgeBase],\n      [\"user\", userMessage]\n  ]\n\n  let assistantResponse = ''\n  \/\/ STREAMING COMPLETION:\n  const stream = await chatModel.stream(messages);\n  for await (const chunk of stream) {\n    assistantResponse += chunk.content\n    process.stdout.write(chunk.content);\n  }\n  console.log(\"n\");\n\n  \/\/ HISTORY: Add both user message and assistant response to history\n  addToHistory(\"default-session-id\", \"user\", userMessage)\n  addToHistory(\"default-session-id\", \"assistant\", assistantResponse)\n\n}\n\n\/\/ Helper function to get or create a conversation history\nfunction getConversationHistory(sessionId, maxTurns = parseInt(process.env.HISTORY_MESSAGES)) {\n  if (!conversationMemory.has(sessionId)) {\n    conversationMemory.set(sessionId, [])\n  }\n  return conversationMemory.get(sessionId)\n}\n\n\/\/ Helper function to add a message to the conversation history\nfunction addToHistory(sessionId, role, content) {\n  const history = getConversationHistory(sessionId)\n  history.push([role, content])\n\n  \/\/ Keep only the last maxTurns conversations\n  const maxTurns = parseInt(process.env.HISTORY_MESSAGES) \/\/ Adjust this value based on your needs\n  if (history.length &gt; maxTurns * 2) { \/\/ *2 because each turn has user &amp; assistant message\n    history.splice(0, 2) \/\/ Remove oldest turn (user + assistant messages)\n  }\n}\n\n<\/pre>\n<\/div>\n<p>You can find it <a href=\"https:\/\/codeberg.org\/k33g-blog\/docker-posts\/src\/branch\/main\/2026-01-12-make-tlm-smarter-js\/02-streaming-completion-and-rag\" rel=\"nofollow noopener\" target=\"_blank\">here<\/a>:\u00a0<\/p>\n<h2 class=\"wp-block-heading\">All that\u2019s left is to launch to verify my hypotheses<\/h2>\n<p>In the project folder, run the following command:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: bash; gutter: false; title: ; notranslate\">\ndocker compose up --build --no-log-prefix -d\n<\/pre>\n<\/div>\n<p>Then connect to the container and launch the application:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: bash; gutter: false; title: ; notranslate\">\ndocker compose exec golang-expert \/bin\/bash\nnode index.js\n<\/pre>\n<\/div>\n<h3 class=\"wp-block-heading\">Nova Chat Agent with streaming completion<\/h3>\n<p>Once the application started, it\u2019s time to ask our question to our Golang expert agent in development with the <strong>Nova<\/strong> library:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: bash; gutter: false; title: ; notranslate\">\n========================================================\n Embeddings model: ai\/embeddinggemma:latest\n Creating embeddings...\n Number of documents read from file: 4\n Creating the embeddings...\n Embeddings created, total of records 4\n\n========================================================\n?  Your question (hf.co\/qwen\/qwen2.5-coder-3b-instruct-gguf:q4_k_m):  \u203a I need a code snippet of a Golang Nova Chat agent using a stream completion\n\n\n<\/pre>\n<\/div>\n<p>After a quick similarity search, our agent found two relevant snippets in the vector database. And it was able to provide me with complete, functional, and commented code to answer my request:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\">\nCertainly! Below is a complete, working Go code snippet for a Nova Chat agent that uses streaming completion. This example demonstrates how to set up an agent with the necessary configurations and generate a stream completion response.\n\n```go\npackage main\n\nimport (\n        \"context\"\n        \"fmt\"\n\n        \"github.com\/snipwise\/nova\/nova-sdk\/agents\"\n        \"github.com\/snipwise\/nova\/nova-sdk\/agents\/chat\"\n        \"github.com\/snipwise\/nova\/nova-sdk\/messages\"\n        \"github.com\/snipwise\/nova\/nova-sdk\/messages\/roles\"\n        \"github.com\/snipwise\/nova\/nova-sdk\/models\"\n)\n\nfunc main() {\n\n        ctx := context.Background()\n\n        \/\/ Create a new chat agent with streaming capabilities\n        agent, err := chat.NewAgent(\n                ctx,\n                agents.Config{\n                        EngineURL:          \"http:\/\/localhost:12434\/engines\/llama.cpp\/v1\",\n                        SystemInstructions: \"You are Bob, a helpful AI assistant.\",\n                        KeepConversationHistory: true,\n                },\n                models.Config{\n                        Name:        \"ai\/qwen2.5:1.5B-F16\",\n                        Temperature: models.Float64(0.8),\n                },\n        )\n        if err != nil {\n                panic(err)\n        }\n\n        \/\/ Generate a stream completion response\n        result, err := agent.GenerateStreamCompletion(\n                []messages.Message{\n                        {Role: roles.User, Content: \"Who is James T Kirk?\"},\n                },\n                func(chunk string, finishReason string) error {\n                        \/\/ Simple callback that receives strings only\n                        if chunk != \"\" {\n                                fmt.Print(chunk)\n                        }\n                        if finishReason == \"stop\" {\n                                fmt.Println()\n                        }\n                        return nil\n                },\n        )\n        if err != nil {\n                panic(err)\n        }\n\n        \/\/ Print the final response and finish reason\n        fmt.Println(\"Final Response:\", result.Response)\n        fmt.Println(\"Finish Reason:\", result.FinishReason)\n}\n\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\">\n### Explanation:\n\n1. **Imports**: The code imports necessary packages from the Nova SDK to interact with chat agents.\n\n2. **Context**: A context is created using `context.Background()` to manage the lifecycle of requests.\n\n3. **Agent Creation**:\n   - An agent is created using `chat.NewAgent` with specified configurations:\n     - `EngineURL`: The URL of the engine to use for generating text.\n     - `SystemInstructions`: Instructions for the AI assistant, such as who Bob is.\n     - `KeepConversationHistory`: Whether to keep a history of previous messages.\n\n4. **Model Configuration**:\n   - A model configuration is provided with:\n     - `Name`: The name of the model to use.\n     - `Temperature`: Controls the randomness of the output (higher values make responses more random).\n     - `MaxTokens`: Limits the maximum number of tokens in the response.\n\n5. **Stream Completion Generation**:\n   - The `GenerateStreamCompletion` method is called with a list of messages and a callback function that processes each chunk of text as it arrives.\n   - The callback prints chunks to the console until the finish reason is \"stop\", indicating the completion has been reached.\n\n6. **Output**:\n   - The final response and finish reason are printed after the stream completes.\n\nThis code snippet demonstrates how to set up a streaming chat agent using Nova, allowing you to receive text responses incrementally as they become available.\n\n<\/pre>\n<\/div>\n<p>I then asked other questions.<\/p>\n<h3 class=\"wp-block-heading\">Nova Structured Agent<\/h3>\n<p>This time, I asked for a structured agent: <strong>\u201cI need a code snippet of a Nova Structured Agent.\u201d<\/strong><\/p>\n<p>And the result was conclusive:<\/p>\n<p><strong>Extract from the response:<\/strong><\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\">\nCertainly! Below is a complete, working example of a Nova Structured Agent in Go. This agent will generate structured data about a specified country based on the user's input.\n\n```go\npackage main\n\nimport (\n        \"context\"\n        \"fmt\"\n\n        \"github.com\/snipwise\/nova\/nova-sdk\/agents\"\n        \"github.com\/snipwise\/nova\/nova-sdk\/agents\/structured\"\n        \"github.com\/snipwise\/nova\/nova-sdk\/messages\"\n        \"github.com\/snipwise\/nova\/nova-sdk\/messages\/roles\"\n        \"github.com\/snipwise\/nova\/nova-sdk\/models\"\n)\n\ntype Country struct {\n        Name       string   `json:\"name\"`\n        Capital    string   `json:\"capital\"`\n        Population int      `json:\"population\"`\n        Languages  []string `json:\"languages\"`\n}\n\nfunc main() {\n        ctx := context.Background()\n\n        \/\/ Create a new Structured Agent\n        agent, err := structured.NewAgent[Country](\n                ctx,\n                agents.Config{\n                        EngineURL: \"http:\/\/localhost:12434\/engines\/llama.cpp\/v1\",\n                        SystemInstructions: `\n                                Your name is Bob.\n                                You are an assistant that answers questions about countries around the world.\n                        `,\n                },\n                models.Config{\n                        Name:        \"hf.co\/menlo\/jan-nano-gguf:q4_k_m\",\n                        Temperature: models.Float64(0.0),\n                },\n        )\n        if err != nil {\n                panic(err)\n        }\n\n        \/\/ Generate structured data about Canada\n        response, finishReason, err := agent.GenerateStructuredData([]messages.Message{\n                {Role: roles.User, Content: \"Tell me about Canada.\"},\n        })\n\n        if err != nil {\n                panic(err)\n        }\n\n        \/\/ Print the generated country information\n        fmt.Println(\"Country Name:\", response.Name)\n        fmt.Println(\"Capital:\", response.Capital)\n        fmt.Println(\"Population:\", response.Population)\n        fmt.Println(\"Languages:\", strings.Join(response.Languages, \", \"))\n}\n\n<\/pre>\n<\/div>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: plain; title: ; notranslate\">\n### Explanation:\n\n1. **Imports**: The code imports necessary packages from the Nova SDK to interact with agents and models.\n\n2. **Country Struct**: A simple struct `Country` is defined to represent the data structure expected by the agent.\n\n3. **Main Function**:\n   - A context is created using `context.Background()`.\n   - A new Structured Agent is created for generating structured data about countries.\n   - The system instructions are set to provide a friendly greeting and instruction on answering questions about countries.\n   - The model configuration specifies the engine URL, temperature, and other parameters.\n\n4. **GenerateStructuredData Method**: This method generates structured data based on the user's input. It takes an array of messages as input, where each message has a role (e.g., `User`) and content (e.g., \"Tell me about Canada\").\n\n5. **Error Handling**: The code includes error handling to ensure that any issues during agent creation or data generation are properly caught and reported.\n\n6. **Output**: The generated country information is printed, including the name, capital, population, and languages.\n\nThis example demonstrates how to use a Nova Structured Agent to generate structured data about countries based on user input. You can modify the system instructions and model configuration as needed for your specific use case.\n\n<\/pre>\n<\/div>\n<p>Then, I moved on to the next question.<\/p>\n<h3 class=\"wp-block-heading\">Nova RAG Agent with vector store. Be careful with your similarity settings!<\/h3>\n<p>This time, I asked for a \u201cRAG\u201d agent: <strong>\u201cI need a snippet of a Nova RAG agent with a vector store.\u201d<\/strong><\/p>\n<p><strong>And once again, I got a relevant response.<\/strong><\/p>\n<p>However, when I tried with this question (after restarting the agent to start from a clean base without conversation history): <strong>\u201cI need a snippet of a Nova RAG agent.\u201d<\/strong><\/p>\n<p>The <strong>similarity search returned no relevant results<\/strong> (because the words \u201cvector store\u201d were not present in the snippets). And the agent responded with generic code that had nothing to do with Nova or was using code from Nova Chat Agents.<\/p>\n<p>There may be several possible reasons:<\/p>\n<ul class=\"wp-block-list\">\n<li>The embedding model is not suitable for my use case,<\/li>\n<li>The embedding model is not precise enough,<\/li>\n<li>The splitting of the code snippets file is not optimal (you can add metadata to chunks to improve similarity search, for example, but don\u2019t forget that chunks must not exceed the maximum size that the embedding model can ingest).<\/li>\n<\/ul>\n<p>In that case, there\u2019s a simple solution that works quite well: <strong>you lower the similarity thresholds and\/or increase the number of returned similarities<\/strong>. This allows you to have more results to construct the user prompt, but be careful not to exceed the maximum context size of the language model. And you can also do tests with other \u201cbigger\u201d LLMs (more parameters and\/or larger context window).<\/p>\n<p>In the latest version of the snippets file, I added a KEYWORDS: \u2026 line below the markdown titles to help with similarity search. Which greatly improved the results obtained.<\/p>\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n<p>Using \u201cSmall Language Models\u201d (SLM) or \u201cTiny Language Models\u201d (TLM) requires a bit of energy and thought to work around their limitations. But it\u2019s possible to build effective solutions for very specific problems. And once again, always think about the context size for the chat model and how you\u2019ll structure the information for the embedding model. And by combining several specialized \u201csmall agents\u201d, you can achieve very interesting results. This will be the subject of future articles.<\/p>\n<p>Learn more<\/p>\n<ul class=\"wp-block-list\">\n<li>Check out <a href=\"https:\/\/www.docker.com\/products\/model-runner\/\">Docker Model Runner<\/a><\/li>\n<li>Learn more about <a href=\"https:\/\/www.docker.com\/solutions\/docker-ai\/\">Docker Agentic Compose<\/a><\/li>\n<li>Read more about embedding in our recent blog <a href=\"https:\/\/www.docker.com\/blog\/run-embedding-models-for-semantic-search\/\">Run Embedding Models and Unlock Semantic Search with Docker Model Runner<\/a><\/li>\n<\/ul>","protected":false},"excerpt":{"rendered":"<p>Hello, I\u2019m Philippe, and I am a Principal Solutions Architect helping customers with their usage of Docker. I started getting [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":3260,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[4],"tags":[],"class_list":["post-3259","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-docker"],"_links":{"self":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/3259","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/comments?post=3259"}],"version-history":[{"count":0,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/3259\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media\/3260"}],"wp:attachment":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media?parent=3259"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/categories?post=3259"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/tags?post=3259"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}