{"id":2141,"date":"2025-06-17T20:15:20","date_gmt":"2025-06-17T20:15:20","guid":{"rendered":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2025\/06\/17\/multimodal-vision-intelligence-with-net-maui\/"},"modified":"2025-06-17T20:15:20","modified_gmt":"2025-06-17T20:15:20","slug":"multimodal-vision-intelligence-with-net-maui","status":"publish","type":"post","link":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2025\/06\/17\/multimodal-vision-intelligence-with-net-maui\/","title":{"rendered":"Multimodal Vision Intelligence with .NET MAUI"},"content":{"rendered":"<p>Expanding the many ways in which users can interact with our apps is one of the most exciting parts of working with modern AI models and device capabilities. With .NET MAUI, it\u2019s easy to enhance your app from a text-based experience to one that supports <strong>voice<\/strong>, <strong>vision<\/strong>, and more.<\/p>\n<p>Previously I covered adding <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/multimodal-voice-intelligence-with-dotnet-maui\"><strong>voice<\/strong> support<\/a> to the \u201cto do\u201d app from <a href=\"https:\/\/www.youtube.com\/watch?v=tFOFU7LDQlA\">our Microsoft Build 2025 session<\/a>. Now I\u2019ll review the <strong>vision<\/strong> side of multimodal intelligence. I want to let users capture or select an image and have AI extract actionable information from it to create a project and tasks in the <a href=\"https:\/\/github.com\/davidortinau\/telepathy\">Telepathic<\/a> sample app. This goes well beyond OCR scanning by using an AI agent to use context and prompting to produce meaningful input.<\/p>\n\n<h2>See what I see<\/h2>\n<p>From the floating action button menu on MainPage the user selects the camera button immediately transitioning to the PhotoPage where MediaPicker takes over. MediaPicker provides a single cross-platform API for working with photo gallery, media picking, and taking photos. It was recently modernized in .NET 10 Preview 4.<\/p>\n<p>The PhotoPageModel handles both photo capture and file picking, starting from the PageAppearing lifecycle event that I\u2019ve easily tapped into using the EventToCommandBehavior <a href=\"https:\/\/learn.microsoft.com\/dotnet\/communitytoolkit\/maui\/behaviors\/event-to-command-behavior\">from the Community Toolkit for .NET MAUI<\/a>.<\/p>\n<p>&lt;ContentPage.Behaviors&gt;<br \/>\n    &lt;toolkit:EventToCommandBehavior<br \/>\n        EventName=&#8221;Appearing&#8221;<br \/>\n        Command=&#8221;{Binding PageAppearingCommand}&#8221;\/&gt;<br \/>\n&lt;\/ContentPage.Behaviors&gt;<\/p>\n<p>The PageAppearing method is decorated with [RelayCommand] which generates a command thanks to the <a href=\"https:\/\/learn.microsoft.com\/dotnet\/communitytoolkit\/mvvm\/generators\/relaycommand\">Community Toolkit for MVVM<\/a> (yes, toolkits are a recurring theme of adoration that you\u2019ll hear from me). I then check for the type of device being used and choose to pick or take a photo. .NET MAUI\u2019s cross-platform APIs for DeviceInfo and MediaPicker save me a ton of time navigating through platform-specific idiosyncrasies.<\/p>\n<p>if (DeviceInfo.Idiom == DeviceIdiom.Desktop)<br \/>\n{<br \/>\n    result = await MediaPicker.PickPhotoAsync(new MediaPickerOptions<br \/>\n    {<br \/>\n        Title = &#8220;Select a photo&#8221;<br \/>\n    });<br \/>\n}<br \/>\nelse<br \/>\n{<br \/>\n    if (!MediaPicker.IsCaptureSupported)<br \/>\n    {<br \/>\n        return;<br \/>\n    }<\/p>\n<p>    result = await MediaPicker.CapturePhotoAsync(new MediaPickerOptions<br \/>\n    {<br \/>\n        Title = &#8220;Take a photo&#8221;<br \/>\n    });<br \/>\n}<\/p>\n<p>Another advantage of using the built-in MediaPicker is giving users the native experience for photo input they are already accustomed to. When you\u2019re implementing this, be sure to perform the <a href=\"https:\/\/learn.microsoft.com\/dotnet\/maui\/platform-integration\/device-media\/picker\">necessary platform-specific setup as documented<\/a>.<\/p>\n<h2>Processing the image<\/h2>\n<p>Once an image is received, it\u2019s desplayed on screen along with an optional Editor field to capture any additional context and instructions the user might want to provide. I build the prompt with StringBuilder (in other apps I like to use Scriban templates), grab an instance of the Microsoft.Extensions.AI\u2018s <a href=\"https:\/\/learn.microsoft.com\/dotnet\/api\/microsoft.extensions.ai.ichatclient\">IChatClient<\/a> from a service, get the image bytes, and supply everything to the chat client using a <a href=\"https:\/\/learn.microsoft.com\/dotnet\/api\/microsoft.extensions.ai.chatmessage\">ChatMessage<\/a> that packs <a href=\"https:\/\/learn.microsoft.com\/dotnet\/api\/microsoft.extensions.ai.textcontent\">TextContent<\/a> and <a href=\"https:\/\/learn.microsoft.com\/dotnet\/api\/microsoft.extensions.ai.datacontent\">DataContent<\/a>.<\/p>\n<p>private async Task ExtractTasksFromImageAsync()<br \/>\n{<br \/>\n    \/\/ more code<\/p>\n<p>    var prompt = new System.Text.StringBuilder();<br \/>\n    prompt.AppendLine(&#8220;# Image Analysis Task&#8221;);<br \/>\n    prompt.AppendLine(&#8220;Analyze the image for task lists, to-do items, notes, or any content that could be organized into projects and tasks.&#8221;);<br \/>\n    prompt.AppendLine();<br \/>\n    prompt.AppendLine(&#8220;## Instructions:&#8221;);<br \/>\n    prompt.AppendLine(&#8220;1. Identify any projects and tasks (to-do items) visible in the image&#8221;);<br \/>\n    prompt.AppendLine(&#8220;2. Format handwritten text, screenshots, or photos of physical notes into structured data&#8221;);<br \/>\n    prompt.AppendLine(&#8220;3. Group related tasks into projects when appropriate&#8221;);<\/p>\n<p>    if (!string.IsNullOrEmpty(AnalysisInstructions))<br \/>\n    {<br \/>\n        prompt.AppendLine($&#8221;4. {AnalysisInstructions}&#8221;);<br \/>\n    }<br \/>\n    prompt.AppendLine();<br \/>\n    prompt.AppendLine(&#8220;If no projects\/tasks are found, return an empty projects array.&#8221;);<\/p>\n<p>    var client = _chatClientService.GetClient();<br \/>\n    byte[] imageBytes = File.ReadAllBytes(ImagePath);<\/p>\n<p>    var msg = new Microsoft.Extensions.AI.ChatMessage(ChatRole.User,<br \/>\n    [<br \/>\n        new TextContent(prompt.ToString()),<br \/>\n        new DataContent(imageBytes, mediaType: &#8220;image\/png&#8221;)<br \/>\n    ]);<\/p>\n<p>    var apiResponse = await client.GetResponseAsync&lt;ProjectsJson&gt;(msg);<\/p>\n<p>    if (apiResponse?.Result?.Projects != null)<br \/>\n    {<br \/>\n        Projects = apiResponse.Result.Projects.ToList();<br \/>\n    }<\/p>\n<p>    \/\/ more code<br \/>\n}<\/p>\n<h2>Human-AI Collaboration<\/h2>\n<p>Just like with the voice experience, the photo flow doesn\u2019t blindly assume the agent got everything right. After processing, the user is shown a proposed set of projects and tasks for review and confirmation.<\/p>\n<p>This ensures users remain in control while benefiting from AI-augmented assistance. You can learn more about designing these kinds of flows using best practices in the <a href=\"https:\/\/www.microsoft.com\/research\/project\/hax-toolkit\">HAX Toolkit<\/a>.<\/p>\n<h2>Resources<\/h2>\n<p><a href=\"https:\/\/github.com\/davidortinau\/telepathy\">Telepathic App Source Code<\/a><br \/>\n<a href=\"https:\/\/learn.microsoft.com\/dotnet\/ai\/\">Microsoft.Extensions.AI<\/a><br \/>\n<a href=\"https:\/\/learn.microsoft.com\/dotnet\/maui\/platform-integration\/device-media\/picker\">MediaPicker Documentation<\/a><br \/>\n<a href=\"https:\/\/www.microsoft.com\/research\/project\/hax-toolkit\">HAX Toolkit<\/a><br \/>\n<a href=\"https:\/\/aka.ms\/RAI\">Microsoft AI Principles<\/a><br \/>\n<a href=\"https:\/\/learn.microsoft.com\/dotnet\/ai\/\">AI for .NET Developers<\/a><\/p>\n<h2>Summary<\/h2>\n<p>We\u2019ve now extended our .NET MAUI app to see as well as hear. With just a few lines of code and a clear UX pattern, the app can take in images, analyze them using vision-capable AI models, and return structured, actionable data like tasks and projects.<\/p>\n<p>Multimodal experiences are more accessible and powerful than ever. With cross-platform support from .NET MAUI and the modularity of Microsoft.Extensions.AI, you can rapidly evolve your apps to meet your users where they are, whether that\u2019s typing, speaking, or snapping a photo.<\/p>\n<p>The post <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/multimodal-vision-intelligence-with-dotnet-maui\/\">Multimodal Vision Intelligence with .NET MAUI<\/a> appeared first on <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\">.NET Blog<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Expanding the many ways in which users can interact with our apps is one of the most exciting parts of [&hellip;]<\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[7],"tags":[],"class_list":["post-2141","post","type-post","status-publish","format-standard","hentry","category-dotnet"],"_links":{"self":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/2141","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/comments?post=2141"}],"version-history":[{"count":0,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/2141\/revisions"}],"wp:attachment":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media?parent=2141"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/categories?post=2141"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/tags?post=2141"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}