{"id":2185,"date":"2025-06-30T14:17:04","date_gmt":"2025-06-30T14:17:04","guid":{"rendered":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2025\/06\/30\/tool-calling-with-local-llms-a-practical-evaluation\/"},"modified":"2025-06-30T14:17:04","modified_gmt":"2025-06-30T14:17:04","slug":"tool-calling-with-local-llms-a-practical-evaluation","status":"publish","type":"post","link":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2025\/06\/30\/tool-calling-with-local-llms-a-practical-evaluation\/","title":{"rendered":"Tool Calling with Local LLMs: A Practical Evaluation"},"content":{"rendered":"<h2 class=\"wp-block-heading\">Which local model should I use for tool calling?<\/h2>\n<p>When building GenAI and agentic applications, one of the most pressing and persistent questions is: <em>\u201cWhich local model should I use for tool calling?\u201d<\/em>\u00a0 We kept hearing again and again, from colleagues within Docker and the developer community, ever since we started working on <a href=\"https:\/\/www.docker.com\/blog\/introducing-docker-model-runner\/\"><strong>Docker Model Runner<\/strong><\/a>, a local inference engine that helps developers run and experiment with local models.\u00a0<\/p>\n\n<p>It\u2019s a deceptively simple question with a surprisingly nuanced answer. Even when we tried to answer it for a very specific case: <em>\u201cWhat if I just expose 5 simple tools to the model?\u201d<\/em><br \/>We realized we had no definite answer for<strong> <\/strong>that. <a href=\"https:\/\/www.docker.com\/blog\/run-llms-locally\/\">Local LLM models<\/a> offer control, cost-efficiency, and privacy, but when it comes to structured tool use, deciding when and how to act, they can behave very differently. We decided to dig deep and test this properly. We started with manual experimentation, then built a framework to scale our testing. This blog documents that journey and shares which models ranked highest on our tool-calling leaderboard.<\/p>\n\n<h2 class=\"wp-block-heading\">The first attempt: Manual testing<\/h2>\n<p>Our first instinct was to build something quickly and try it out manually.<\/p>\n<p>So we created<a href=\"https:\/\/github.com\/ilopezluna\/chat2cart\" target=\"_blank\"> <strong>chat2cart<\/strong><\/a>, an AI-powered shopping assistant that lets users interact via chat to build, modify, and check out a shopping cart. Through a natural conversation, users can discover products, add or remove items, and complete or cancel their purchase, all from the chat interface.<\/p>\n<p>To support testing across different LLMs, we added a model selector that makes it easy to switch between local models (via Docker Model Runner or Ollama) and hosted models using the OpenAI API.<\/p>\n<p>OpenAI\u2019s GPT-4 or GPT-3.5 worked as expected, and the experience was fairly smooth.\u00a0<\/p>\n<p>Called tools when they were needed<\/p>\n<p>Avoided unnecessary tool usage<\/p>\n<p>Handled tool responses naturally<\/p>\n<p>But the local models? That\u2019s where the challenges started to surface.<\/p>\n\n<h2 class=\"wp-block-heading\">What went wrong with local models<\/h2>\n<p>We started experimenting with some of the local models listed on the<a href=\"https:\/\/huggingface.co\/spaces\/gorilla-llm\/berkeley-function-calling-leaderboard\" target=\"_blank\"> Berkeley Function-Calling Leaderboard<\/a>. Our goal was to find smaller models, ideally with fewer than 10 billion parameters, so we tested xLAM-2-8b-fc-r and watt-tool-8B. We quickly ran into several recurring issues:<\/p>\n<p><strong>Eager invocation<\/strong>: Tools were being called even for greeting messages like \u201cHi there!\u201d<\/p>\n<p><strong>Wrong tool selection<\/strong>: The model would search when it should have added, or tried to remove when the cart was empty<\/p>\n<p><strong>Invalid arguments<\/strong>: Parameters like product_name or quantity were missing or malformed<\/p>\n<p><strong>Ignored responses<\/strong>: The model often failed to respond to tool output, leading to awkward or incomplete conversations<\/p>\n<p>At this point, it was clear that manual testing wouldn\u2019t scale. Different models failed in different ways, some struggled with invocation logic, while others mishandled tool arguments or responses.\u00a0 Testing was not only slow, but also unreliable. Because these models are non-deterministic, we had to run each scenario multiple times just to get a reliable read on behavior.<\/p>\n<p>We needed a testing setup that was repeatable, measurable, and fast.<\/p>\n\n<h2 class=\"wp-block-heading\">Our second attempt: A scalable testing tool<\/h2>\n<p>Our goal wasn\u2019t academic rigor.<br \/>It was: <em>\u201cGive us good-enough answers in 2\u20133 days, not weeks.\u201d<\/em><\/p>\n<p>In a couple of days, we created<a href=\"https:\/\/github.com\/docker\/model-test\" target=\"_blank\"> <strong>model-test<\/strong><\/a>, This is a flexible project with the following capabilities<\/p>\n<p>Define real-world <strong>test cases<\/strong> with multiple valid tool call sequences<\/p>\n<p>Run them against many models (local &amp; hosted)<\/p>\n<p>Track <strong>tool-calling accuracy<\/strong>, <strong>tool selection<\/strong>, and <strong>latency<\/strong><\/p>\n<p>Log <strong>everything<\/strong> for analysis (or eventual fine-tuning)<\/p>\n<h3 class=\"wp-block-heading\">How it works<\/h3>\n<p>The core idea behind model-test is simple: simulate realistic tool-using conversations, give the model room to reason and act, and check whether its behavior makes sense.<\/p>\n<p>Each test case includes:<\/p>\n<p>A <strong>prompt<\/strong> (e.g. \u201cAdd iPhone to cart\u201d)<\/p>\n<p>The <strong>initial cart state<\/strong> (optional)<\/p>\n<p>One or more <strong>valid tool-call variants<\/strong>, because there\u2019s often more than one right answer<\/p>\n<p>Here\u2019s a typical case:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n{<br \/>\n\u00a0\u00a0&#8220;prompt&#8221;: &#8220;Add iPhone to cart&#8221;,<br \/>\n\u00a0\u00a0&#8220;expected_tools_variants&#8221;: [<br \/>\n\u00a0\u00a0\u00a0\u00a0{<br \/>\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8220;name&#8221;: &#8220;direct_add&#8221;,<br \/>\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8220;tools&#8221;: [{ &#8220;name&#8221;: &#8220;add_to_cart&#8221;, &#8220;arguments&#8221;: { &#8220;product_name&#8221;: &#8220;iPhone&#8221; } }]<br \/>\n\u00a0\u00a0\u00a0\u00a0},<br \/>\n\u00a0\u00a0\u00a0\u00a0{<br \/>\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8220;name&#8221;: &#8220;search_then_add&#8221;,<br \/>\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8220;tools&#8221;: [<br \/>\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0{ &#8220;name&#8221;: &#8220;search_products&#8221;, &#8220;arguments&#8221;: { &#8220;query&#8221;: &#8220;iPhone&#8221; } },<br \/>\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0{ &#8220;name&#8221;: &#8220;add_to_cart&#8221;, &#8220;arguments&#8221;: { &#8220;product_name&#8221;: &#8220;iPhone 15&#8221; } }<br \/>\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0]<br \/>\n\u00a0\u00a0\u00a0\u00a0}<br \/>\n\u00a0\u00a0]<br \/>\n}\n<\/div>\n\n<p>In this case, we consider both <strong>\u201cjust add \u2018iPhone&#8217;\u201d<\/strong> and <strong>\u201csearch first, then add the result\u201d<\/strong> as acceptable. Even though \u201ciPhone\u201d isn\u2019t a real product name, we\u2019re fine with it. We weren\u2019t aiming for overly strict precision, just realistic behavior.<\/p>\n<p>Each test case belongs to a test suite. We provide two built-in suites. However, you can run an entire suite, individual test cases, or a selection of multiple test cases. Additionally, you can create your own custom suites to group tests as needed.\u00a0<\/p>\n<p><strong>Simple<\/strong>: Greetings, single-step actions<\/p>\n<p><strong>Complex<\/strong>: Multi-step reasoning and tool chaining<\/p>\n<h3 class=\"wp-block-heading\">The agent loop<\/h3>\n<p>To make tests feel closer to how real agents behave, we simulate an agent loop up to <strong>5 rounds<\/strong>.<\/p>\n<p>Example:<\/p>\n<p>User: <em>\u201cAdd iPhone 5 to cart\u201d<\/em><\/p>\n<p>Model: <em>\u201cLet me search for iPhone 5\u2026\u201d<\/em><em><br \/><\/em><\/p>\n<p>Tool: <em>(returns product list)<\/em><em><br \/><\/em><\/p>\n<p>Model: <em>\u201cAdding product X to cart\u2026\u201d<\/em><em><br \/><\/em><\/p>\n<p>Tool: <em>(updates cart)<\/em><em><br \/><\/em><\/p>\n<p>Model: <em>\u201cDone\u201d<\/em><em><br \/><\/em> \u2192 Great, test passed!<\/p>\n<p>But if the model still wants to keep going after round 5?<\/p>\n<p>That\u2019s it, my friend,\u00a0 <strong>test failed<\/strong>. Time\u2019s up.<\/p>\n\n<h3 class=\"wp-block-heading\">Not all-or-nothing<\/h3>\n<p>We deliberately avoided designing tests that require perfect predictions.<\/p>\n<p>We didn\u2019t demand that the model always know the exact product name.<\/p>\n<p>What mattered was: <strong>did the tool sequence make sense<\/strong> for the intent?<\/p>\n<p>This helped us focus on the kind of reasoning and behavior we actually want in agents, not just perfect token matches.<\/p>\n\n<h2 class=\"wp-block-heading\">What We Measured<\/h2>\n<p>Our test outputs distilled down to a final F1 score, encapsulating three core dimensions:<\/p>\n<div class=\"wp-block-ponyo-table style__default\">\n<p><strong>Metric<\/strong><\/p>\n<p><strong>What it tells us<\/strong><\/p>\n<p><strong>Tool Invocation<\/strong><\/p>\n<p>Did the model realize a tool was needed?<\/p>\n<p><strong>Tool Selection<\/strong><\/p>\n<p>Did it choose the right tool(s) and use them correctly?<\/p>\n<p><strong>Parameter accuracy<\/strong><\/p>\n<p>Whether the tool call arguments were correct?<\/p>\n<\/div>\n<p>The F1 score is the harmonic mean of two things: precision (how often the model made valid tool calls) and recall (how often it made the tool calls it was supposed to).<\/p>\n<p>We also tracked latency, the average runtime in seconds, but that wasn\u2019t part of the F1 calculation; it simply helped us evaluate speed and user experience.<\/p>\n<h2 class=\"wp-block-heading\">21 models and 3,570 tests later: Which models nailed tool calling?<\/h2>\n<p>We tested 21 models across <strong>3570 test cases<\/strong> using 210 batch runs.<\/p>\n<p><strong>Hardware<\/strong>: MacBook Pro M4 Max, 128GB RAM<br \/><strong>Runner<\/strong>:<a href=\"https:\/\/github.com\/docker\/model-test\/blob\/main\/test-all-models.sh\" target=\"_blank\"> test-all-models.sh<\/a><\/p>\n\n<h3 class=\"wp-block-heading\">Overall Rankings (by Tool Selection F1):<\/h3>\n<div class=\"wp-block-ponyo-table style__default\">\n<p>Model <\/p>\n<p>F1 Score<\/p>\n<p>gpt-4<\/p>\n<p>0.974<\/p>\n<p>qwen3:14B-Q4_K_M<\/p>\n<p>0.971<\/p>\n<p>qwen3:14B-Q6_K<\/p>\n<p>0.943<\/p>\n<p>claude-3-haiku-20240307<\/p>\n<p>0.933<\/p>\n<p>qwen3:8B-F16<\/p>\n<p>0.933<\/p>\n<p>qwen3:8B-Q4_K_M<\/p>\n<p>0.919<\/p>\n<p>gpt-3.5-turbo<\/p>\n<p>0.899<\/p>\n<p>gpt-4o<\/p>\n<p>0.857<\/p>\n<p>gpt-4o-mini<\/p>\n<p>0.852<\/p>\n<p>claude-3-5-sonnet-20241022<\/p>\n<p>0.851<\/p>\n<p>llama3.1:8B-F16<\/p>\n<p>0.835<\/p>\n<p>qwen2.5:14B-Q4_K_M<\/p>\n<p>0.812<\/p>\n<p>claude-3-opus-20240229<\/p>\n<p>0.794<\/p>\n<p>llama3.1:8B-Q4_K_M<\/p>\n<p>0.793<\/p>\n<p>qwen2.5:7B-Q4_K_M<\/p>\n<p>0.753<\/p>\n<p>gemma3:4B<\/p>\n<p>0.733<\/p>\n<p>llama3.2:3B_F16<\/p>\n<p>0.727<\/p>\n<p>llama3grog:7B-Q4_K_M<\/p>\n<p>0.723<\/p>\n<p>llama3.3:70B.Q4_K_M<\/p>\n<p>0.607<\/p>\n<p>llama-xlam:8B-Q4_K_M<\/p>\n<p>0.570<\/p>\n<p>watt-tool:8B-Q4_K_M<\/p>\n<p>0.484<\/p>\n<\/div>\n<h3 class=\"wp-block-heading\">Top performers<\/h3>\n<p>Among all models, OpenAI\u2019s GPT-4 came out on top with a tool selection F1 score of 0.974, completing responses in just under 5 seconds on average. While hosted and not the focus of our local model exploration, it served as a reliable benchmark and provided some ground truths.<\/p>\n<p>On the local side, Qwen 3 (14B) delivered outstanding results, nearly matching GPT-4 with a 0.971 F1 score, though with significantly higher latency (~142 seconds per interaction).<\/p>\n<p>If you\u2019re looking for something faster, Qwen 3 (8B) also achieved an F1 score of 0.933, while cutting latency nearly in half (~84 seconds), making it a compelling balance between speed and tool-use accuracy.<\/p>\n<p>Hosted models like Claude 3 Haiku also performed very well, hitting 0.933 F1 with exceptional speed (3.56 seconds average latency), further illustrating the high bar set by cloud-based offerings.<\/p>\n\n<h3 class=\"wp-block-heading\">Underperformers<\/h3>\n<p>Not all models handled tool calling well. The quantized Watt 8B model struggled with parameter accuracy and ended up with a tool selection F1 score of just 0.484. Similarly, the LLaMA-based XLam 8B variant often missed the correct tool path altogether, finishing with an F1 score of 0.570. These models may be suitable for other tasks, but for our structured tool use test, they underdeliver.<\/p>\n\n<h3 class=\"wp-block-heading\">Quantization<\/h3>\n<p>We also experimented with both <strong>quantized<\/strong> and <strong>non-quantized<\/strong> variants for some models, and in all cases observed <strong>no significant difference<\/strong> in tool-calling behavior or performance. This suggests that quantization is beneficial for reducing resource usage without negatively impacting accuracy or reasoning quality, at least for the models and scenarios we tested.<\/p>\n\n<h3 class=\"wp-block-heading\">Our recommendations<\/h3>\n<p>If your goal is maximum tool-calling accuracy, then Qwen 3 (14B) or Qwen 3 (8B) are your best bets, both local, both precise, with the 8B variant being notably faster.<\/p>\n<p>For a good trade-off between speed and performance, Qwen 2.5 stood out as a solid option. It\u2019s fast enough to support real-time experiences, while still maintaining decent tool selection accuracy.<\/p>\n<p>If you need something more lightweight, especially for resource-constrained environments, the <a href=\"https:\/\/groq.com\/introducing-llama-3-groq-tool-use-models\/\" target=\"_blank\">LLaMA 3 Groq 7B<\/a> variant offers modest performance at a much lower compute footprint.<\/p>\n\n<h2 class=\"wp-block-heading\">What we learned and why this matters<\/h2>\n<p>Our testing confirmed that the Qwen family of models leads the pack among open-source options for tool calling. But as always, there\u2019s a trade-off; you\u2019ll need to balance between accuracy and latency when designing your application<\/p>\n<p><strong>Qwen models dominate<\/strong>: Even the 8B version of Qwen3 outperformed any other local model<\/p>\n<p><strong>Reasoning = latency<\/strong>: Higher-accuracy models take longer, often significantly.<\/p>\n<p>Tool calling is core to almost every real-world GenAI application. Whether you\u2019re building agents or creating agentic workflows, your LLM must know when to act and how. Thanks to this simple framework, \u201cWe don\u2019t know which model to pick\u201d became \u201cWe\u2019ve narrowed it down to three great options, each with clear pros and cons.\u201d<\/p>\n<p>If you\u2019re evaluating <a href=\"https:\/\/hub.docker.com\/catalogs\/models\" target=\"_blank\">models for your agentic applications<\/a>, skip the guesswork. Try<a href=\"https:\/\/github.com\/docker\/model-test\" target=\"_blank\"> model-test<\/a> and make it your own for testing!\u00a0<\/p>\n\n<h2 class=\"wp-block-heading\">Learn more<\/h2>\n<p>Get an inside look at the <a href=\"https:\/\/www.docker.com\/blog\/how-we-designed-model-runner-and-whats-next\/\">design architecture of the Docker Model Runner<\/a>.\u00a0<\/p>\n<p>Explore the <a href=\"https:\/\/www.docker.com\/blog\/oci-artifacts-for-ai-model-packaging\/\">story<\/a> behind our model distribution specification<\/p>\n<p>Read our quickstart guide to<a href=\"https:\/\/www.docker.com\/blog\/run-llms-locally\/\"> Docker Model Runner<\/a>.<\/p>\n<p>Find documentation for<a href=\"https:\/\/docs.docker.com\/model-runner\/\" target=\"_blank\"> Model Runner<\/a>.<\/p>\n<p>Subscribe to the<a href=\"https:\/\/www.docker.com\/newsletter-subscription\/\"> Docker Navigator Newsletter<\/a>.<\/p>\n<p>New to Docker?<a href=\"https:\/\/hub.docker.com\/signup?_gl=1*1v81gq1*_gcl_au*MTQxNjU3MjYxNS4xNzQyMjI1MTk2*_ga*MTMxODI0ODQ4LjE3NDE4MTI3NTA.*_ga_XJWPQMJYHQ*czE3NDg0NTYyNzIkbzI2JGcxJHQxNzQ4NDU2MzI2JGo2JGwwJGgw\" target=\"_blank\"> Create an account<\/a>.\u00a0<\/p>\n<p>Have questions? The<a href=\"https:\/\/www.docker.com\/community\/\"> Docker community is here to help<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Which local model should I use for tool calling? When building GenAI and agentic applications, one of the most pressing [&hellip;]<\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[4],"tags":[],"class_list":["post-2185","post","type-post","status-publish","format-standard","hentry","category-docker"],"_links":{"self":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/2185","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/comments?post=2185"}],"version-history":[{"count":0,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/2185\/revisions"}],"wp:attachment":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media?parent=2185"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/categories?post=2185"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/tags?post=2185"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}