{"id":2124,"date":"2025-06-11T17:33:30","date_gmt":"2025-06-11T17:33:30","guid":{"rendered":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2025\/06\/11\/multimodal-voice-intelligence-with-net-maui\/"},"modified":"2025-06-11T17:33:30","modified_gmt":"2025-06-11T17:33:30","slug":"multimodal-voice-intelligence-with-net-maui","status":"publish","type":"post","link":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2025\/06\/11\/multimodal-voice-intelligence-with-net-maui\/","title":{"rendered":"Multimodal Voice Intelligence with .NET MAUI"},"content":{"rendered":"<p>One of the most interesting ways to enhance your existing applications with AI is to enable more ways for your users to interact with it. Today you probably handle text input, and perhaps some touch gestures for your power users. Now it\u2019s easier than ever to expand that to voice and vision, especially when your users\u2019 primary input is via mobile device.<\/p>\n<p>At <a href=\"https:\/\/youtu.be\/tFOFU7LDQlA?si=wJy9xZtO2kuw1sq9\">Microsoft Build 2025 I demonstrated<\/a> expanding the .NET MAUI sample \u201cto do\u201d app from text input to supporting voice and vision when those capabilities are detected. Let me show you how .NET MAUI and our fantastic ecosystem of plugins makes this rather painless to do with a single implementation that works across all platforms starting with voice.<\/p>\n\n<h2>Talk to me<\/h2>\n<p>Being able to talk to an app isn\u2019t anything revolutionary. We\u2019ve all spoken to Siri, Alexa, and our dear Cortana a time or two, and the key is in knowing the keywords and recipes of things they can comprehend and act on. \u201cStart a timer\u201d, \u201cturn down the volume\u201d, \u201ctell me a joke\u201d, and everyone\u2019s favorite \u201cI wasn\u2019t talking to you\u201d.<\/p>\n<p>The new and powerful capability we now have with large language models is having them take our unstructured ramblings and make sense of that in order to fit the structured format our apps expect and require.<\/p>\n<h3>Listening to audio<\/h3>\n<p>The first thing to do is add the Plugin.Maui.Audio NuGet which helps us request permissions to the microphone and start capturing a stream. The plugin is also capable of playback.<\/p>\n<p>dotnet add package Plugin.Maui.Audio &#8211;version 4.0.0<\/p>\n<p>In MauiProgram.cs configure the recording settings and add the IAudioService from the plugin to the services container.<\/p>\n<p>public static class MauiProgram<br \/>\n{<br \/>\n public static MauiApp CreateMauiApp()<br \/>\n {<br \/>\n  var builder = MauiApp.CreateBuilder();<br \/>\n  builder<br \/>\n   .UseMauiApp&lt;App&gt;()<br \/>\n            .AddAudio(<br \/>\n    recordingOptions =&gt;<br \/>\n    {<br \/>\n#if IOS || MACCATALYST<br \/>\n     recordingOptions.Category = AVFoundation.AVAudioSessionCategory.Record;<br \/>\n     recordingOptions.Mode = AVFoundation.AVAudioSessionMode.Default;<br \/>\n     recordingOptions.CategoryOptions = AVFoundation.AVAudioSessionCategoryOptions.MixWithOthers;<br \/>\n#endif<br \/>\n    });<\/p>\n<p>        builder.Services.AddSingleton&lt;IAudioService, AudioService&gt;();<br \/>\n        \/\/ more code<br \/>\n    }<br \/>\n}<\/p>\n<p>Be sure to also review and implement any additional configuration steps in the <a href=\"https:\/\/github.com\/jfversluis\/Plugin.Maui.Audio\/blob\/main\/docs\/audio-recorder.md\">documentation<\/a>.<\/p>\n<p>Now the app is ready to capture some audio. In VoicePage the user will tap the microphone button, start speaking, and tap again to end the recording.<\/p>\n<p>This is a trimmed version of the <a href=\"https:\/\/github.com\/davidortinau\/telepathy\/blob\/main\/src\/Telepathic\/PageModels\/VoicePageModel.cs#L93\">actual code<\/a> for starting and stopping the recording.<\/p>\n<p>[RelayCommand]<br \/>\nprivate async Task ToggleRecordingAsync()<br \/>\n{<br \/>\n    if (!IsRecording)<br \/>\n    {<br \/>\n        var status = await Permissions.CheckStatusAsync&lt;Permissions.Microphone&gt;();<br \/>\n        if (status != PermissionStatus.Granted)<br \/>\n        {<br \/>\n            status = await Permissions.RequestAsync&lt;Permissions.Microphone&gt;();<br \/>\n            if (status != PermissionStatus.Granted)<br \/>\n            {<br \/>\n                \/\/ more code<br \/>\n                return;<br \/>\n            }<br \/>\n        }<\/p>\n<p>        _recorder = _audioManager.CreateRecorder();<br \/>\n        await _recorder.StartAsync();<\/p>\n<p>        IsRecording = true;<br \/>\n        RecordButtonText = &#8220;\u23f9 Stop&#8221;;<br \/>\n    }<br \/>\n    else<br \/>\n    {<br \/>\n        _audioSource = await _recorder.StopAsync();<br \/>\n        IsRecording = false;<br \/>\n        RecordButtonText = &#8220;\ud83c\udfa4 Record&#8221;;<\/p>\n<p>        \/\/ more code<\/p>\n<p>        TranscribeAsync();<br \/>\n    }<br \/>\n}<\/p>\n<p>Once it has the audio stream it can start transcribing and processing it. (<a href=\"https:\/\/github.com\/davidortinau\/telepathy\/blob\/main\/src\/Telepathic\/PageModels\/VoicePageModel.cs#L173\">source<\/a>)<\/p>\n<p>private async Task TranscribeAsync()<br \/>\n{<br \/>\n    string audioFilePath = Path.Combine(FileSystem.CacheDirectory, $&#8221;recording_{DateTime.Now:yyyyMMddHHmmss}.wav&#8221;);<\/p>\n<p>    if (_audioSource != null)<br \/>\n    {<br \/>\n        await using (var fileStream = File.Create(audioFilePath))<br \/>\n        {<br \/>\n            var audioStream = _audioSource.GetAudioStream();<br \/>\n            await audioStream.CopyToAsync(fileStream);<br \/>\n        }<\/p>\n<p>        Transcript = await _transcriber.TranscribeAsync(audioFilePath, CancellationToken.None);<\/p>\n<p>        await ExtractTasksAsync();<br \/>\n    }<br \/>\n}<\/p>\n<p>In this sample app, I used Microsoft.Extensions.AI with OpenAI to perform the transcription with the whisper-1 <a href=\"https:\/\/openai.com\/index\/whisper\/\">model trained specifically for this use case<\/a>. There are certainly other methods of doing this including on-device with <a href=\"https:\/\/learn.microsoft.com\/dotnet\/communitytoolkit\/maui\/essentials\/speech-to-text\">SpeechToText<\/a> in the .NET MAUI Community Toolkit.<\/p>\n<p>By using Microsoft.Extensions.AI I can easily swap out another cloud based AI service, use a local LLM with <a href=\"https:\/\/onnxruntime.ai\/docs\/tutorials\/mobile\/\">ONNX<\/a>, or later choose another on-device solution.<\/p>\n<p>using Microsoft.Extensions.AI;<br \/>\nusing OpenAI;<\/p>\n<p>namespace Telepathic.Services;<\/p>\n<p>public class WhisperTranscriptionService : ITranscriptionService<br \/>\n{<br \/>\n    public async Task&lt;string&gt; TranscribeAsync(string path, CancellationToken ct)<br \/>\n    {<br \/>\n        var openAiApiKey = Preferences.Default.Get(&#8220;openai_api_key&#8221;, string.Empty);<br \/>\n        var client = new OpenAIClient(openAiApiKey);<\/p>\n<p>        try<br \/>\n        {<br \/>\n            await using var stream = File.OpenRead(path);<br \/>\n            var result = await client<br \/>\n                            .GetAudioClient(&#8220;whisper-1&#8221;)<br \/>\n                            .TranscribeAudioAsync(stream, &#8220;file.wav&#8221;, cancellationToken: ct);<\/p>\n<p>            return result.Value.Text.Trim();<br \/>\n        }<br \/>\n        catch (Exception ex)<br \/>\n        {<br \/>\n            \/\/ Will add better error handling in Phase 5<br \/>\n            throw new Exception($&#8221;Failed to transcribe audio: {ex.Message}&#8221;, ex);<br \/>\n        }<br \/>\n    }<br \/>\n}<\/p>\n<h3>Making sense and structure<\/h3>\n<p>Once I have the transcript, I can have my AI service make sense of it to return projects and tasks using the same client. This happens in the ExtractTasksAsync method referenced above. The key parts of this method are below. (<a href=\"https:\/\/github.com\/davidortinau\/telepathy\/blob\/main\/src\/Telepathic\/PageModels\/VoicePageModel.cs#L250\">source<\/a>)<\/p>\n<p>private async Task ExtractTasksAsync()<br \/>\n{<br \/>\n    var prompt = $@&#8221;<br \/>\n        Extract projects and tasks from this voice memo transcript.<br \/>\n        Analyze the text to identify actionable tasks I need to keep track of. Use the following instructions:<br \/>\n        1. Tasks are actionable items that can be completed, such as &#8216;Buy groceries&#8217; or &#8216;Call Mom&#8217;.<br \/>\n        2. Projects are larger tasks that may contain multiple smaller tasks, such as &#8216;Plan birthday party&#8217; or &#8216;Organize closet&#8217;.<br \/>\n        3. Tasks must be grouped under a project and cannot be grouped under multiple projects.<br \/>\n        4. Any mentioned due dates use the YYYY-MM-DD format<\/p>\n<p>        Here&#8217;s the transcript: {Transcript}&#8221;;<\/p>\n<p>    var chatClient = _chatClientService.GetClient();<br \/>\n    var response = await chatClient.GetResponseAsync&lt;ProjectsJson&gt;(prompt);<\/p>\n<p>    if (response?.Result != null)<br \/>\n    {<br \/>\n        Projects = response.Result.Projects;<br \/>\n    }<br \/>\n}<\/p>\n<p>The _chatClientService is an injected service class that handles the creation and retrieval of the IChatClient instance provided by Microsoft.Extensions.AI. Here I use the GetResponseAsync method along with passing a strong type and a prompt, and the LLM (gpt-4o-mini in this case) returns a ProjectsJson response. The response includes a Projects list with which I can proceed.<\/p>\n<h2>Co-creation<\/h2>\n<p>Now I\u2019ve gone from having an app that only took data entry input via a form, to an app that can also take unstructured voice input and produce structure data. While I was tempted to just insert the results into the database and claim success, there was yet more to do to make this a truly satisfying experience.<\/p>\n<p>There\u2019s a reasonable chance that the project name needs to be adjusted for clarity, or some task was misheard or worse yet omitted. To address this, I add a step of approval where the use can see the projects and tasks as recommendations and choose to accept them as-is with changes. This is not much different than the experience we have now in Copilot when changes are make but we have the option to iterate further, keep, or discard.<\/p>\n<p>For more guidance like this for designing great AI experiences in your apps, consider checking out the <a href=\"https:\/\/www.microsoft.com\/research\/project\/hax-toolkit\">HAX Toolkit<\/a> and <a href=\"https:\/\/aka.ms\/RAI\">Microsoft AI Principles<\/a>.<\/p>\n<h2>Resources<\/h2>\n<p>Here are key resources mentioned in this article to help you implement multimodal AI capabilities in your .NET MAUI apps:<\/p>\n<p><a href=\"https:\/\/youtu.be\/tFOFU7LDQlA?si=wJy9xZtO2kuw1sq9\">AI infused mobile &amp; desktop app development with .NET MAUI <\/a><br \/>\n<a href=\"https:\/\/github.com\/jfversluis\/Plugin.Maui.Audio\">Plugin.Maui.Audio<\/a> \u2013 NuGet package for handling audio recording and playback in .NET MAUI apps<br \/>\n<a href=\"https:\/\/learn.microsoft.com\/dotnet\/ai\/\">Microsoft.Extensions.AI<\/a> \u2013 Framework for integrating AI capabilities into .NET applications<br \/>\n<a href=\"https:\/\/openai.com\/index\/whisper\/\">Whisper Model<\/a> \u2013 OpenAI\u2019s speech-to-text model used for audio transcription<br \/>\n<a href=\"https:\/\/learn.microsoft.com\/dotnet\/communitytoolkit\/maui\/essentials\/speech-to-text\">SpeechToText in .NET MAUI Community Toolkit<\/a> \u2013 On-device alternative for speech recognition<br \/>\n<a href=\"https:\/\/onnxruntime.ai\/docs\/tutorials\/mobile\/\">ONNX Runtime<\/a> \u2013 For running local LLMs on mobile devices<br \/>\n<a href=\"https:\/\/www.microsoft.com\/research\/project\/hax-toolkit\">HAX Toolkit<\/a> \u2013 Design guidance for AI experiences in applications<br \/>\n<a href=\"https:\/\/aka.ms\/RAI\">Microsoft AI Principles<\/a> \u2013 Guidelines for responsible AI implementation<br \/>\n<a href=\"https:\/\/github.com\/davidortinau\/telepathy\">Telepathy Sample App Source Code<\/a> \u2013 Complete implementation example referenced in this article<\/p>\n<h2>Summary<\/h2>\n<p>In this article, we explored how to enhance .NET MAUI applications with multimodal AI capabilities, focusing on voice interaction. We covered how to implement audio recording using <a href=\"https:\/\/github.com\/jfversluis\/Plugin.Maui.Audio\">Plugin.Maui.Audio<\/a>, transcribe speech using <a href=\"https:\/\/learn.microsoft.com\/dotnet\/ai\/\">Microsoft.Extensions.AI<\/a> with OpenAI\u2019s <a href=\"https:\/\/openai.com\/index\/whisper\/\">Whisper model<\/a>, and extract structured data from unstructured voice input.<\/p>\n<p>By combining these technologies, you can transform a traditional form-based app into one that accepts voice commands and intelligently processes them into actionable data. The implementation works across all platforms with a single codebase, making it accessible for any .NET MAUI developer.<\/p>\n<p>With these techniques, you can significantly enhance user experience by supporting multiple interaction modes, making your applications more accessible and intuitive, especially on mobile devices where voice input can be much more convenient than typing.<\/p>\n<p>The post <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/multimodal-voice-intelligence-with-dotnet-maui\/\">Multimodal Voice Intelligence with .NET MAUI<\/a> appeared first on <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\">.NET Blog<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>One of the most interesting ways to enhance your existing applications with AI is to enable more ways for your [&hellip;]<\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[7],"tags":[],"class_list":["post-2124","post","type-post","status-publish","format-standard","hentry","category-dotnet"],"_links":{"self":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/2124","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/comments?post=2124"}],"version-history":[{"count":0,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/2124\/revisions"}],"wp:attachment":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media?parent=2124"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/categories?post=2124"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/tags?post=2124"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}