{"id":1228,"date":"2024-09-12T11:13:15","date_gmt":"2024-09-12T11:13:15","guid":{"rendered":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2024\/09\/12\/performance-improvements-in-net-9\/"},"modified":"2024-09-12T11:13:15","modified_gmt":"2024-09-12T11:13:15","slug":"performance-improvements-in-net-9","status":"publish","type":"post","link":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/2024\/09\/12\/performance-improvements-in-net-9\/","title":{"rendered":"Performance Improvements in .NET 9"},"content":{"rendered":"<p>Each year, summer arrives to find me daunted and excited to write about the performance improvements in the upcoming release of .NET. \u201cDaunted,\u201d because these posts, covering <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-8\/\">.NET 8<\/a>, <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance_improvements_in_net_7\/\">.NET 7<\/a>, <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-6\">.NET 6<\/a>, <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-5\">.NET 5<\/a>, <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-core-3-0\">.NET Core 3.0<\/a>, <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-core-2-1\">.NET Core 2.1<\/a>, and <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-core\">.NET Core 2.0<\/a>, have garnered a bit of a reputation I want to ensure the next iteration lives up to. And \u201cexcited,\u201d because there\u2019s such an abundance of material to cover due to just how much goodness has been packed into the next .NET release, I struggle to get it all written down as quickly as my thoughts whirl.<\/p>\n<p>And so, every year, I start these posts talking about how the next release of .NET is the fastest and best release to date. That\u2019s true for .NET 9 as well, of course, but the statement that .NET 9 is the fastest and best release of .NET to date is now a bit\u2026 mundane. So, let\u2019s spice it up a bit. How about\u2026 a haiku?<\/p>\n<p>As the falcon flies,<br \/>\n.NET 9 speeds joy into<br \/>\nDevelopers&#8217; hearts.<\/p>\n<p>Or, maybe a limerick:<\/p>\n<p>In the coding world, there&#8217;s a star,<br \/>\n.NET 9, the best by far.<br \/>\nWith speed that&#8217;s supreme,<br \/>\nIt&#8217;s every coder&#8217;s dream,<br \/>\nTaking development to a new par.<\/p>\n<p>A little gimmicky? Maybe something more classical, a sonnet perhaps:<\/p>\n<p>In realms of code where brilliance finds its way,<br \/>\n.NET 9 shines with an unmatched array.<br \/>\nIts speed and grace, a marvel to behold,<br \/>\nTransforming tasks to treasures, fast and bold.<\/p>\n<p>Developers, with joy, embrace its might,<br \/>\nTheir projects soar, efficiency in sight.<br \/>\nNo longer bound by limits of the past,<br \/>\nIn .NET 9, their dreams are built to last.<\/p>\n<p>Its libs, a symphony of pure delight,<br \/>\nTurning complex to simple, dim to light.<br \/>\nWith every line of code, a masterpiece,<br \/>\nIn .NET 9, dev burdens find release.<\/p>\n<p>Oh, wondrous .NET 9, you light the way,<br \/>\nIn your embrace, our future&#8217;s bright as day.<\/p>\n<p>Ok, so, yeah, I should stick to writing software rather than poetry (something with which my college poetry professor likely agreed). Nevertheless, the sentiment remains: .NET 9 is an incredibly exciting release. More than 7,500 pull requests (PRs) have merged into <a href=\"https:\/\/github.com\/dotnet\/runtime\">dotnet\/runtime<\/a> in the last year, of which a significant percentage have touched on performance in one way, shape, or form. In this post, we\u2019ll take a tour through over 350 PRs that have all found their way into packing .NET 9 full of performance yumminess. Please grab a large cup of your favorite hot beverage, sit back, settle in, and enjoy.<\/p>\n<h2>Benchmarking Setup<\/h2>\n<p>In this post, I\u2019ve included micro-benchmarks to showcase various performance improvements. Most of these benchmarks are implemented using <a href=\"https:\/\/github.com\/dotnet\/benchmarkdotnet\">BenchmarkDotNet<\/a> <a href=\"https:\/\/www.nuget.org\/packages\/BenchmarkDotNet\/0.14.0\">v0.14.0<\/a>, and, unless otherwise noted, there is a simple setup for each.<\/p>\n<p>To follow along, first make sure you have <a href=\"https:\/\/dotnet.microsoft.com\/download\/dotnet\/8.0\">.NET 8<\/a> and <a href=\"https:\/\/dotnet.microsoft.com\/download\/dotnet\/9.0\">.NET 9<\/a> installed. The numbers I share were gathered using the .NET 9 Release Candidate.<\/p>\n<p>Once you have the appropriate prerequisites installed, create a new C# project in a new benchmarks directory:<\/p>\n<p>dotnet new console -o benchmarks<br \/>\ncd benchmarks<\/p>\n<p>The resulting directory will contain two files: benchmarks.csproj, which is the project file with information about how the application should be compiled, and Program.cs, which contains the code for the application. Replace the entire contents of benchmarks.csproj with this:<\/p>\n<p>&lt;Project Sdk=&#8221;Microsoft.NET.Sdk&#8221;&gt;<\/p>\n<p>  &lt;PropertyGroup&gt;<br \/>\n    &lt;OutputType&gt;Exe&lt;\/OutputType&gt;<br \/>\n    &lt;TargetFrameworks&gt;net9.0;net8.0&lt;\/TargetFrameworks&gt;<br \/>\n    &lt;LangVersion&gt;Preview&lt;\/LangVersion&gt;<br \/>\n    &lt;ImplicitUsings&gt;enable&lt;\/ImplicitUsings&gt;<br \/>\n    &lt;Nullable&gt;enable&lt;\/Nullable&gt;<br \/>\n    &lt;AllowUnsafeBlocks&gt;true&lt;\/AllowUnsafeBlocks&gt;<br \/>\n    &lt;ServerGarbageCollection&gt;true&lt;\/ServerGarbageCollection&gt;<br \/>\n  &lt;\/PropertyGroup&gt;<\/p>\n<p>  &lt;ItemGroup&gt;<br \/>\n    &lt;PackageReference Include=&#8221;BenchmarkDotNet&#8221; Version=&#8221;0.14.0&#8243; \/&gt;<br \/>\n  &lt;\/ItemGroup&gt;<\/p>\n<p>&lt;\/Project&gt;<\/p>\n<p>The preceding project file tells the build system we want:<\/p>\n<p>to build a runnable application, as opposed to a library.<br \/>\nto be able to run on both .NET 8 and .NET 9, so that BenchmarkDotNet can build multiple versions of the application, one to run on each version, in order to compare the results.<br \/>\nto be able to use all of the latest features from the C# language even though C# 13 hasn\u2019t officially shipped yet.<br \/>\nto automatically import common namespaces.<br \/>\nto be able to use nullable reference type annotations in the code.<br \/>\nto be able to use the unsafe keyword in the code.<br \/>\nto configure the garbage collector (GC) into its \u201cserver\u201d configuration, which impacts the trade-offs it makes between memory consumption and throughput. This isn\u2019t required, but it\u2019s how most services are configured.<br \/>\nto pull in BenchmarkDotNet v0.14.0 from NuGet so that we\u2019re able to use the library in Program.cs.<\/p>\n<p>For each benchmark, I\u2019ve then included the full Program.cs source; to test it, just replace the entire contents of your Program.cs with the shown benchmark. Each test may be configured slightly differently from others, in order to highlight the key aspects being shown. For example, some tests include the [MemoryDiagnoser(false)] attribute, which tells BenchmarkDotNet to track allocation-related metrics, or the [DisassemblyDiagnoser] attribute, which tells BenchmarkDotNet to find and share the assembly code for the test, or the [HideColumns] attribute, which removes some output columns that BenchmarkDotNet might otherwise emit but that are unnecessary clutter for our needs in this post.<\/p>\n<p>Running the benchmarks is then simple. Each test includes a comment at its top for the dotnet command to use to run the benchmark. It\u2019s typically something like this:<\/p>\n<p>dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>That:<\/p>\n<p>builds the benchmarks in a Release build. Compiling for Release is important as both the C# compiler and the JIT compiler have optimizations that are disabled for Debug. Thankfully, BenchmarkDotNet warns if Debug is accidentally used:<br \/>\n\/\/ Validating benchmarks:<br \/>\n\/\/    * Assembly Benchmarks which defines benchmarks is non-optimized<br \/>\nBenchmark was built without optimization enabled (most probably a DEBUG configuration). Please, build it in RELEASE.<br \/>\nIf you want to debug the benchmarks, please see https:\/\/benchmarkdotnet.org\/articles\/guides\/troubleshooting.html#debugging-benchmarks.<\/p>\n<p>targets .NET 8 for the host project. There are multiple builds involved here: the \u201chost\u201d application you run with the above command, which uses BenchmarkDotNet, which will in turn generate and build an application per target runtime. Because the code for the benchmark is compiled into all of these, you typically want the host project to target the oldest runtime you\u2019ll be testing, so that building the host application will fail if you try to use an API that\u2019s not available in all of the target runtimes.<br \/>\nruns all of the benchmarks in the whole program. If you don\u2019t specify the &#8211;filter argument, BenchmarkDotNet will prompt you to ask which benchmarks to run. By specifying \u201c*\u201d, we\u2019re saying \u201cdon\u2019t prompt, just run \u2019em all.\u201d You can also specify an expression to filter down which subset of the tests you want invoked.<br \/>\nruns the tests on both .NET 8 and .NET 9.<\/p>\n<p>Throughout the post, I\u2019ve shown many benchmarks and the results I received from running them. Unless otherwise stated (e.g. because I\u2019m demonstrating an OS-specific improvement), the results shown for benchmarks are from running them on Linux (Ubuntu 22.04) on an x64 processor.<\/p>\n<p>BenchmarkDotNet v0.14.0, Ubuntu 22.04.3 LTS (Jammy Jellyfish) WSL<br \/>\n11th Gen Intel Core i9-11950H 2.60GHz, 1 CPU, 16 logical and 8 physical cores<br \/>\n.NET SDK 9.0.100-rc.1.24452.12<br \/>\n  [Host]     : .NET 9.0.0 (9.0.24.43107), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI<\/p>\n<p>My standard caveat: these are micro-benchmarks, often measuring operations that take very short periods of time, but where improvements to those times add up to be impactful when executed over and over and over. Different hardware, different operating systems, what other processes might be running on your machine, who you had breakfast with this morning, and the alignment of the planets can all impact the numbers you get out. In short, the numbers you see are unlikely to match exactly the numbers I share here; however, I\u2019ve chosen benchmarks that should be broadly repeatable.<\/p>\n<p>With all that out of the way, let\u2019s do this!<\/p>\n<h2>JIT<\/h2>\n<p>Improvements in .NET show up at all levels of the stack. Some changes result in large improvements in one specific area. Other changes result in small improvements across many things. When it comes to broad-reaching impact, there are few areas of .NET that result in changes more broadly-impactful than those changes made to the Just In Time (JIT) compiler. Code generation improvements help make everything better, and it\u2019s where we\u2019ll start our journey.<\/p>\n<h3>PGO<\/h3>\n<p>In <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-8\/\">Performance Improvements in .NET 8<\/a>, I called out the enabling of dynamic profile guided optimization (PGO) as my favorite feature in the release, so PGO seems like a good place to start for .NET 9.<\/p>\n<p>As a brief refresher, dynamic PGO is a feature that enables the JIT to profile code and use what it learns from that profiling to help it generate more efficient code based on the exact usage patterns of the application. The JIT utilizes tiered compilation, which allows code to be compiled and then re-compiled, possibly multiple times, achieving something new each time the code is compiled. For example, a typical method might start out at \u201ctier 0,\u201d where the JIT applies very few optimizations and has a goal of simply getting to functional assembly as quickly as possible. This helps with startup performance, as optimizations are one of the most costly things a compiler does. Then the runtime tracks the number of times the method is invoked, and if the number of invocations trips over a particular threshold, such that it seems like performance could actually matter, the JIT will re-generate code for it, still at tier 0, but this time with a bunch of additional instrumentation injected into the method, tracking all manner of things that could help the JIT better optimize, e.g. for a given virtual dispatch, what is the most common type on which the call is being performed. Then after enough data has been gathered, the JIT can compile the method yet again, this time at \u201ctier 1,\u201d fully optimized, also incorporating all of the learnings from that profile data. This same flow is relevant as well for code that\u2019s already been pre-compiled with ReadyToRun (R2R), except instead of instrumenting tier 0 code, the JIT will generate optimized, instrumented code on its way to generating a re-optimized implementation.<\/p>\n<p>In .NET 8, the JIT in particular paid attention to PGO data about types and methods involved in virtual, interface, and delegate dispatch. In .NET 9, it\u2019s also able to use PGO data to optimize casts. Thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/90594\">dotnet\/runtime#90594<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/90735\">dotnet\/runtime#90735<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96597\">dotnet\/runtime#96597<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96731\">dotnet\/runtime#96731<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97773\">dotnet\/runtime#97773<\/a>, dynamic PGO is now able to track the most common input types to cast operations (castclass\/isinst, e.g. what you get from doing operations like (T)obj or obj is T), and then when generating the optimized code, emit special checks that add fast paths for the most common types. For example, in the following benchmark, we have a field of type A initialized to a type C that\u2019s derived from both B and A. Then the benchmark is type checking the instance stored in that A field to see whether it\u2019s a B or anything derived from B.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser(maxDepth: 0)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private A _obj = new C();<\/p>\n<p>    [Benchmark]<br \/>\n    public bool IsInstanceOf() =&gt; _obj is B;<\/p>\n<p>    public class A { }<br \/>\n    public class B : A { }<br \/>\n    public class C : B { }<br \/>\n}<\/p>\n<p>That IsInstanceOf benchmark results in the following disassembly on .NET 8:<\/p>\n<p>; Tests.IsInstanceOf()<br \/>\n       push      rax<br \/>\n       mov       rsi,[rdi+8]<br \/>\n       mov       rdi,offset MT_Tests+B<br \/>\n       call      qword ptr [7F3D91524360]; System.Runtime.CompilerServices.CastHelpers.IsInstanceOfClass(Void*, System.Object)<br \/>\n       test      rax,rax<br \/>\n       setne     al<br \/>\n       movzx     eax,al<br \/>\n       add       rsp,8<br \/>\n       ret<br \/>\n; Total bytes of code 35<\/p>\n<p>but now on .NET 9, it produces this:<\/p>\n<p>; Tests.IsInstanceOf()<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       mov       rsi,[rdi+8]<br \/>\n       mov       rcx,rsi<br \/>\n       test      rcx,rcx<br \/>\n       je        short M00_L00<br \/>\n       mov       rax,offset MT_Tests+C<br \/>\n       cmp       [rcx],rax<br \/>\n       jne       short M00_L01<br \/>\nM00_L00:<br \/>\n       test      rcx,rcx<br \/>\n       setne     al<br \/>\n       movzx     eax,al<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM00_L01:<br \/>\n       mov       rdi,offset MT_Tests+B<br \/>\n       call      System.Runtime.CompilerServices.CastHelpers.IsInstanceOfClass(Void*, System.Object)<br \/>\n       mov       rcx,rax<br \/>\n       jmp       short M00_L00<br \/>\n; Total bytes of code 62<\/p>\n<p>On .NET 8, it\u2019s loading the reference to the object and the desired method token for B, and calling the CastHelpers.IsInstanceOfClass JIT helper to do the type check. On .NET 9, instead it\u2019s loading the method token for C, which it saw during profiling to be the most common type used, and then comparing that against the actual object\u2019s method token. If they match, since the JIT knows that C derives from B, it then knows the object is in fact a B. If they don\u2019t match, then it jumps down to the fallback path where it does the same thing that was being done on .NET 8, loading the reference and the desired method token for B and calling IsInstanceOfClass.<\/p>\n<p>It\u2019s also capable of optimizing for the negative case where the cast most often fails. Consider this benchmark:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser(maxDepth: 0)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private object _obj = &#8220;hello&#8221;;<\/p>\n<p>    [Benchmark]<br \/>\n    public bool IsInstanceOf() =&gt; _obj is Tests;<br \/>\n}<\/p>\n<p>On .NET 9, we get this assembly:<\/p>\n<p>; Tests.IsInstanceOf()<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       mov       rsi,[rdi+8]<br \/>\n       mov       rcx,rsi<br \/>\n       test      rcx,rcx<br \/>\n       je        short M00_L00<br \/>\n       mov       rax,offset MT_System.String<br \/>\n       cmp       [rcx],rax<br \/>\n       jne       short M00_L01<br \/>\n       xor       ecx,ecx<br \/>\nM00_L00:<br \/>\n       test      rcx,rcx<br \/>\n       setne     al<br \/>\n       movzx     eax,al<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM00_L01:<br \/>\n       mov       rdi,offset MT_Tests<br \/>\n       call      System.Runtime.CompilerServices.CastHelpers.IsInstanceOfClass(Void*, System.Object)<br \/>\n       mov       rcx,rax<br \/>\n       jmp       short M00_L00<br \/>\n; Total bytes of code 64<\/p>\n<p>Here the incoming object is always a string and never the Tests class that\u2019s being tested for. The generated code is comparing the incoming object against string, and then, assuming the types match, the JIT knows the object is not a Tests.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96311\">dotnet\/runtime#96311<\/a> also breaks new ground with dynamic PGO, by teaching it how to profile integers and paying attention to their most common values. Then in conjunction with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96571\">dotnet\/runtime#96571<\/a>, it uses this super power to optimize Buffer.Memmove (which is the workhorse behind methods like Span&lt;T&gt;.CopyTo) and SpanHelpers.SequenceEqual (which is the implementation behind methods like string.Equals). Previously, the JIT was taught how to unroll such operations, where if a constant length was provided, the JIT could generate the exact code sequence to implement the operation for that length. Now with this capability, the JIT can track the most common lengths provided to these methods, and if there\u2019s one length that really stands out, it can special-case it, unrolling and vectorizing the operation when the length matches and falling back to calling the original when it doesn\u2019t. While this is expected to improve in the future, for .NET 9 this set of length-profiling optimizations only kicks in when R2R is disabled, as the JIT is otherwise unable to do the exact profiling required. Disabling R2R is something services can do when startup performance isn\u2019t a big concern and they instead care about maximum throughput at run-time.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Configs;<br \/>\nusing BenchmarkDotNet.Environments;<br \/>\nusing BenchmarkDotNet.Jobs;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>var config = DefaultConfig.Instance<br \/>\n    .AddJob(Job.Default.WithId(&#8220;.NET 8&#8221;).WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable(&#8220;DOTNET_ReadyToRun&#8221;, &#8220;0&#8221;))<br \/>\n    .AddJob(Job.Default.WithId(&#8220;.NET 9&#8221;).WithRuntime(CoreRuntime.Core90).WithEnvironmentVariable(&#8220;DOTNET_ReadyToRun&#8221;, &#8220;0&#8221;));<br \/>\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);<\/p>\n<p>[DisassemblyDiagnoser(maxDepth: 0)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;, &#8220;a&#8221;, &#8220;b&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    [Arguments(&#8220;abcd&#8221;, &#8220;abcg&#8221;)]<br \/>\n    public bool Equals(string a, string b) =&gt; a == b;<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nCode Size<\/p>\n<p>Equals<br \/>\n.NET 8.0<br \/>\n2.8592 ns<br \/>\n78 B<\/p>\n<p>Equals<br \/>\n.NET 9.0<br \/>\n0.6754 ns<br \/>\n87 B<\/p>\n<h3>Tier 0<\/h3>\n<p>Tier 0 is all about getting to functioning code quickly, and as such most optimizations are disabled. However, every now and then there\u2019s a reason to do a bit more optimization in tier 0, in situations where the benefits of doing so outweigh the cons. Several of those occurred in .NET 9.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104815\">dotnet\/runtime#104815<\/a> is a simple example. The ArgumentNullException.ThrowIfNull method is now used in thousands upon thousands of places for doing argument validation. It\u2019s a non-generic method, accepting an object argument and checking to see whether it\u2019s null. That non-genericity causes some friction for folks when it\u2019s used with value types. It\u2019s rare for someone to directly call ThrowIfNull with a value type (other than maybe with a Nullable&lt;T&gt;), and in fact if they do, thanks to <a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/6815\">dotnet\/roslyn-analyzers<\/a> from <a href=\"https:\/\/github.com\/CollinAlpert\">@CollinAlpert<\/a>, there\u2019s now the <a href=\"https:\/\/learn.microsoft.com\/dotnet\/fundamentals\/code-analysis\/quality-rules\/ca2264\">CA2264<\/a> analyzer that will warn that what\u2019s being done is nonsensical:<\/p>\n<p>Instead, the common case is when the argument being validated is an unconstrained generic. In such cases, if the generic argument ends up being a value type, it\u2019ll be boxed in the call to ThrowIfNull. That boxing allocation gets removed in tier 1, because the ThrowIfNull call gets inlined and the JIT can see at the call site that the boxing was unnecessary. But, because inlining doesn\u2019t happen in tier 0, such boxing has remained in tier 0. As the API is so ubiquitous, this caused developers to fret that there was something bad happening, and it caused enough consternation that the JIT now special-cases ArgumentNullException.ThrowIfNull and avoids the boxing, even in tier 0. This is easy to see with a little test console app:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221;<br \/>\n\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using System.Runtime.CompilerServices;<\/p>\n<p>while (true)<br \/>\n{<br \/>\n    Test();<br \/>\n}<\/p>\n<p>[MethodImpl(MethodImplOptions.NoInlining)]<br \/>\nstatic void Test()<br \/>\n{<br \/>\n    long gc = GC.GetAllocatedBytesForCurrentThread();<br \/>\n    for (int i = 0; i &lt; 100; i++)<br \/>\n    {<br \/>\n        ThrowIfNull(i);<br \/>\n    }<br \/>\n    gc = GC.GetAllocatedBytesForCurrentThread() &#8211; gc;<\/p>\n<p>    Console.WriteLine(gc);<br \/>\n    Thread.Sleep(1000);<br \/>\n}<\/p>\n<p>static void ThrowIfNull&lt;T&gt;(T value) =&gt; ArgumentNullException.ThrowIfNull(value);<\/p>\n<p>When I run that on .NET 8, I get results like this:<\/p>\n<p>2400<br \/>\n2400<br \/>\n2400<br \/>\n0<br \/>\n0<br \/>\n0<\/p>\n<p>The first few iterations are invoking Test() at tier 0, such that each call to ArgumentNullException.ThrowIfNull boxes the input int. Then when the method gets recompiled at tier 1, the boxing gets elided, and we stabilize at zero allocation. Now on .NET 9, I get results like this:<\/p>\n<p>0<br \/>\n0<br \/>\n0<br \/>\n0<br \/>\n0<br \/>\n0<\/p>\n<p>With these tweaks to tier 0, the boxing is also elided in tier 0, and so starts out without any allocation.<\/p>\n<p>Another tier 0 boxing example is <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/90496\">dotnet\/runtime#90496<\/a>. There\u2019s a hot path method in the async\/await machinery: AsyncTaskMethodBuilder&lt;TResult&gt;.AwaitUnsafeOnCompleted (see <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/how-async-await-really-works\/\">How Async\/Await Really Works in C#<\/a> for all the details). It\u2019s really important that this method be optimized well, but it performs various type tests that can end up boxing in tier 0. In a previous release, that boxing was deemed too impactful to startup for async methods invoked early in an application\u2019s lifetime, so [MethodImpl(MethodImplOptions.AggressiveOptimization)] was used to opt the method out of tiering, such that it gets optimized from the get-go. But that itself has downsides, because if it skips tiering up, it also skips dynamic PGO, and thus the optimized code isn\u2019t as good as it possibly could be. So, this PR specifically addresses those type tests patterns that box, removing the boxing in tier 0, enabling removing that AggressiveOptimization from AwaitUnsafeOnCompleted, and thereby enabling better optimized code generation for it.<\/p>\n<p>Optimizations are avoided in tier 0 because they might slow down compilation. If there are really cheap optimizations, though, and they can have a meaningful impact, they can be worth enabling. That\u2019s especially true if the optimizations can actually help to make compilations and startup faster, such as by minimizing calls to helpers that may take locks, trigger certain kinds of loading, etc. And that\u2019s what <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/105190\">dotnet\/runtime#105190<\/a> does, enabling some constant folding in tier 0 at relatively little cost. Even with the low cost, though, there were still concerns about possible impact to JIT throughput, and the PR was fast-followed by <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/105250\">dotnet\/runtime#105250<\/a> which optimized some JIT code paths to make up for any impact from the former change.<\/p>\n<p>Another similar case is <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/91403\">dotnet\/runtime#91403<\/a> from <a href=\"https:\/\/github.com\/MichalPetryka\">@MichalPetryka<\/a>, which allows optimizations around RuntimeHelpers.CreateSpan to kick in for tier 0. Without that, the runtime can end up allocating many field stubs, which themselves add overhead to the startup path.<\/p>\n<h3>Loops<\/h3>\n<p>Applications spend a lot of time iterating through loops, and finding ways to reduce the overheads of loops has been a key focus for .NET 9. It\u2019s also been quite successful.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102261\">dotnet\/runtime#102261<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103181\">dotnet\/runtime#103181<\/a> help to remove some instructions from even the tightest of loops by converting upward counting loops into downward counting loops. Consider a loop like the following:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public int UpwardCounting()<br \/>\n    {<br \/>\n        int count = 0;<br \/>\n        for (int i = 0; i &lt; 100; i++)<br \/>\n        {<br \/>\n            count++;<br \/>\n        }<br \/>\n        return count;<br \/>\n    }<br \/>\n}<\/p>\n<p>Here\u2019s what the generated assembly code for that core loop looks like on .NET 8:<\/p>\n<p>M00_L00:<br \/>\n       inc       eax<br \/>\n       inc       ecx<br \/>\n       cmp       ecx,64<br \/>\n       jl        short M00_L00<\/p>\n<p>It\u2019s incrementing eax, which is storing count. And it\u2019s incrementing ecx, which is storing i. It\u2019s then comparing ecx against 100 (0x64) to see if it\u2019s reached the end of the loop, and jumping back up to the beginning of the loop if it hasn\u2019t.<\/p>\n<p>Now let\u2019s manually rewrite the loop to be downward counting:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public int DownwardCounting()<br \/>\n    {<br \/>\n        int count = 0;<br \/>\n        for (int i = 99; i &gt;= 0; i&#8211;)<br \/>\n        {<br \/>\n            count++;<br \/>\n        }<br \/>\n        return count;<br \/>\n    }<br \/>\n}<\/p>\n<p>And here\u2019s what the generated assembly code for that core loop looks like there:<\/p>\n<p>M00_L00:<br \/>\n       inc       eax<br \/>\n       dec       ecx<br \/>\n       jns       short M00_L00<\/p>\n<p>The key observation here is that by counting down, we can replace a cmp\/jl for comparing against a specific bound to instead just be a jns that jumps if the value isn\u2019t negative. We\u2019ve thus removed an instruction from a tight loop that only had four to begin with. With the aforementioned PRs, the JIT can now do that transformation automatically where it\u2019s applicable and deemed valuable, such that the loop in UpwardCounting now results in the same assembly code on .NET 9 as does the loop in DownwardCounting.<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>UpwardCounting<br \/>\n.NET 8.0<br \/>\n30.27 ns<br \/>\n1.00<\/p>\n<p>UpwardCounting<br \/>\n.NET 9.0<br \/>\n26.52 ns<br \/>\n0.88<\/p>\n<p>However, the JIT is only able to do this transformation if the iteration variable (i) isn\u2019t used in the body of the loop, and obviously there are many loops where it is, such as by indexing into an array being iterated over. Thankfully, other optimizations in .NET 9 are able to reduce the actual reliance on the iteration variable, such that this optimization now kicks in frequently.<\/p>\n<p>One such optimization is strength reduction in loops. In compilers, \u201cstrength reduction\u201d is the simple idea of taking something relatively expensive and replacing it with something cheaper. In the context of loops, that typically means introducing more \u201cinduction variables\u201d (variables whose values change in a predictable pattern on each iteration, such as being incremented by a constant amount). For example, consider a simple loop that sums all of the elements of an array:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private int[] _array = Enumerable.Range(0, 1000).ToArray();<\/p>\n<p>    [Benchmark]<br \/>\n    public int Sum()<br \/>\n    {<br \/>\n        int[] array = _array;<br \/>\n        int sum = 0;<\/p>\n<p>        for (int i = 0; i &lt; array.Length; i++)<br \/>\n        {<br \/>\n            sum += array[i];<br \/>\n        }<\/p>\n<p>        return sum;<br \/>\n    }<br \/>\n}<\/p>\n<p>We get the following assembly on .NET 8:<\/p>\n<p>; Tests.Sum()<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       mov       rax,[rdi+8]<br \/>\n       xor       ecx,ecx<br \/>\n       xor       edx,edx<br \/>\n       mov       edi,[rax+8]<br \/>\n       test      edi,edi<br \/>\n       jle       short M00_L01<br \/>\nM00_L00:<br \/>\n       mov       esi,edx<br \/>\n       add       ecx,[rax+rsi*4+10]<br \/>\n       inc       edx<br \/>\n       cmp       edi,edx<br \/>\n       jg        short M00_L00<br \/>\nM00_L01:<br \/>\n       mov       eax,ecx<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\n; Total bytes of code 35<\/p>\n<p>The interesting part is the loop starting at M00_L00. i is being stored in edx (though it gets copied into esi), and as part of adding the next element from the array to sum (which is stored in ecx), we\u2019re loading that next value from the array with the address rax+rsi*4+10. A strength reduction view of this would say \u201crather than re-computing the address on each iteration, we can instead have another induction variable and increment it by 4 on each iteration.\u201d A key benefit of that is it then removes a dependency on i from inside of the loop, which then means the iteration variable is no longer used in the loop, enabling the aforementioned downward counting optimization to kick in. That leads to the following assembly on .NET 9:<\/p>\n<p>; Tests.Sum()<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       mov       rax,[rdi+8]<br \/>\n       xor       ecx,ecx<br \/>\n       mov       edx,[rax+8]<br \/>\n       test      edx,edx<br \/>\n       jle       short M00_L01<br \/>\n       add       rax,10<br \/>\nM00_L00:<br \/>\n       add       ecx,[rax]<br \/>\n       add       rax,4<br \/>\n       dec       edx<br \/>\n       jne       short M00_L00<br \/>\nM00_L01:<br \/>\n       mov       eax,ecx<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\n; Total bytes of code 35<\/p>\n<p>Note the loop at M00_L00: it\u2019s now downward counting, reading the next value from the array is simply dereferencing the address in rax, and the address in rax is incremented by 4 each go around.<\/p>\n<p>A lot of work went into enabling this strength reduction, including providing the basic implementation (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104243\">dotnet\/runtime#104243<\/a>), enabling it by default (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/105131\">dotnet\/runtime#105131<\/a>), finding more opportunities to apply it (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/105169\">dotnet\/runtime#105169<\/a>), and using it to enable post-indexed addressing (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/105181\">dotnet\/runtime#105181<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/105185\">dotnet\/runtime#105185<\/a>), which is an Arm addressing mode where the address stored in the base register is used but then that register is updated to point to the next target memory location. A new phase was also added to the JIT to help with optimizing such induction variables (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97865\">dotnet\/runtime#97865<\/a>), and in particular, to do induction variable widening where 32-bit induction variables (think of every loop you\u2019ve ever written that starts with for (int i = &#8230;)) are widened to 64-bit induction variables. This widening can help to avoid zero extensions that might otherwise occur on every iteration of the loop.<\/p>\n<p>These optimizations are all new, but of course there are also many loop optimizations already present in the JIT compiler, from loop unrolling to loop cloning to loop hoisting. In order to apply such loop optimizations, though, the JIT first needs to recognize loops, and that can sometimes be more challenging than it would seem (<a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/43713#issue-727046316\">dotnet\/runtime#43713<\/a> describes a case where the JIT was failing to do so). Historically, the JIT\u2019s loop recognition was based on a relatively simplistic lexical analysis. In .NET 8, as part of the work to improve dynamic PGO, a more powerful graph-based loop analyzer was added that was able to recognize many more loops. For .NET 9 with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/95251\">dotnet\/runtime#95251<\/a>, that analyzer was factored out so that it could be used for generalized loop reasoning. And then with PRs like <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96756\">dotnet\/runtime#96756<\/a> for loop alignment, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96754\">dotnet\/runtime#96754<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96553\">dotnet\/runtime#96553<\/a> for loop cloning, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96752\">dotnet\/runtime#96752<\/a> for loop unrolling, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96751\">dotnet\/runtime#96751<\/a> for loop canonicalization, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96753\">dotnet\/runtime#96753<\/a> for loop hoisting, many of these loop-related optimizations have now been moved to the better scheme. All of that means that more loops get optimized.<\/p>\n<h3>Bounds Checks<\/h3>\n<p>.NET code is, by default, \u201cmemory safe.\u201d Unlike in C, where you can iterate through an array and easily walk off the end of it, by default accesses to arrays, strings, and spans are \u201cbounds checked\u201d to ensure you can\u2019t walk off the end or before the beginning. Of course, such bounds checking adds overhead, and so wherever the JIT can prove that adding such checks would be unnecessary, it\u2019ll elide the bounds check, knowing that it\u2019s impossible for the guarded accesses to be problematic. The quintessential example of this is a loop over an array from 0 to array.Length. Let\u2019s look at the same benchmark we just looked at, summing all the elements of an integer array:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser(maxDepth: 0)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private int[] _array = new int[1000];<\/p>\n<p>    [Benchmark]<br \/>\n    public int Test()<br \/>\n    {<br \/>\n        int[] array = _array;<\/p>\n<p>        int sum = 0;<br \/>\n        for (int i = 0; i &lt; array.Length; i++)<br \/>\n        {<br \/>\n            sum += array[i];<br \/>\n        }<br \/>\n        return sum;<br \/>\n    }<br \/>\n}<\/p>\n<p>That Test benchmark results in this assembly code on .NET 8:<\/p>\n<p>; Tests.Test()<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       mov       rax,[rdi+8]<br \/>\n       xor       ecx,ecx<br \/>\n       xor       edx,edx<br \/>\n       mov       edi,[rax+8]<br \/>\n       test      edi,edi<br \/>\n       jle       short M00_L01<br \/>\nM00_L00:<br \/>\n       mov       esi,edx<br \/>\n       add       ecx,[rax+rsi*4+10]<br \/>\n       inc       edx<br \/>\n       cmp       edi,edx<br \/>\n       jg        short M00_L00<br \/>\nM00_L01:<br \/>\n       mov       eax,ecx<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\n; Total bytes of code 35<\/p>\n<p>The key part to pay attention to is the loop at M00_L00, for which the only branch is the one comparing edx (which is tracking i) to edi (which was earlier on initialized to the length of the array, [rax+8]) as part of knowing when it\u2019s done iterating. There\u2019s no additional check required to make this safe, as the JIT knows the loop started at 0 (and thus isn\u2019t walking off the beginning of the array) and the JIT knows iteration ends at the array length, which the JIT is already checking for, so it\u2019s safe to index into the array without additional checks.<\/p>\n<p>Now, let\u2019s tweak the benchmark ever so slightly. In the above, I was copying the _array field to a local array and then doing all accesses against that array; this is critical, because there\u2019s nothing else that could be changing that local out from under the loop. But if we instead change the code to refer to the field directly:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser(maxDepth: 0)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private int[] _array = new int[1000];<\/p>\n<p>    [Benchmark]<br \/>\n    public int Test()<br \/>\n    {<br \/>\n        int sum = 0;<br \/>\n        for (int i = 0; i &lt; _array.Length; i++)<br \/>\n        {<br \/>\n            sum += _array[i];<br \/>\n        }<br \/>\n        return sum;<br \/>\n    }<br \/>\n}<\/p>\n<p>now we get this on .NET 8:<\/p>\n<p>; Tests.Test()<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       xor       eax,eax<br \/>\n       xor       ecx,ecx<br \/>\n       mov       rdx,[rdi+8]<br \/>\n       cmp       dword ptr [rdx+8],0<br \/>\n       jle       short M00_L01<br \/>\n       nop       dword ptr [rax]<br \/>\n       nop       dword ptr [rax]<br \/>\nM00_L00:<br \/>\n       mov       rdi,rdx<br \/>\n       cmp       ecx,[rdi+8]<br \/>\n       jae       short M00_L02<br \/>\n       mov       esi,ecx<br \/>\n       add       eax,[rdi+rsi*4+10]<br \/>\n       inc       ecx<br \/>\n       cmp       [rdx+8],ecx<br \/>\n       jg        short M00_L00<br \/>\nM00_L01:<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM00_L02:<br \/>\n       call      CORINFO_HELP_RNGCHKFAIL<br \/>\n       int       3<br \/>\n; Total bytes of code 61<\/p>\n<p>That\u2019s a whole lot worse. Note how much that loop starting at M00_L00 has grown, and in particular note that instead of just having the one cmp\/jg pair at the end, there\u2019s another cmp\/jae pair in the middle, just before it accesses the array element. Since the code is reading from the field on every access, the JIT needs to accommodate the fact that the reference could change between any two accesses; thus, even though the JIT is comparing against _array.Length as part of the loop bounds, it also needs to ensure that the subsequent reference to _array[i] is still in bounds, since by then _array may be an entirely different object. That\u2019s a \u201cbounds check,\u201d which is obvious from the tell-tale sign that immediately after the cmp, there\u2019s a conditional jump to code that unconditionally calls CORINFO_HELP_RNGCHKFAIL; that\u2019s the helper function that\u2019s called to throw an IndexOutOfRangeException when you try to walk off the end of one of these data structures.<\/p>\n<p>Every release the JIT gets better at removing more and more bounds checks where it can prove they\u2019re superfluous. One of my favorite such improvements in .NET 9 is there on my favorites list because I\u2019ve historically expected the optimization to \u201cjust work\u201d, for various reasons it didn\u2019t, and now it does (it also shows up in a fair amount of real code, which is why I\u2019ve bumped up against it). In this benchmark, the function is handed an offset and a span, and its job is to sum all of the numbers from that offset to the end of the span.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Runtime.CompilerServices;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    [Arguments(3)]<br \/>\n    public int Test() =&gt; M(0, &#8220;1234567890abcdefghijklmnopqrstuvwxyz&#8221;);<\/p>\n<p>    [MethodImpl(MethodImplOptions.NoInlining)]<br \/>\n    public static int M(int i, ReadOnlySpan&lt;char&gt; src)<br \/>\n    {<br \/>\n        int sum = 0;<\/p>\n<p>        while (true)<br \/>\n        {<br \/>\n            if ((uint)i &gt;= src.Length)<br \/>\n            {<br \/>\n                break;<br \/>\n            }<\/p>\n<p>            sum += src[i++];<br \/>\n        }<\/p>\n<p>        return sum;<br \/>\n    }<br \/>\n}<\/p>\n<p>By casting i to uint as part of the comparison to src.Length, the JIT knows that i is in bounds of src by the time i is used to index into src, because if i were negative, the cast to uint would have made it larger than int.MaxValue and thus also larger than src.Length (which can\u2019t possibly be larger than int.MaxValue). The .NET 8 assembly shows the bounds check has been elided (note the lack of CORINFO_HELP_RNGCHKFAIL):<\/p>\n<p>; Tests.M(Int32, System.ReadOnlySpan`1&lt;Char&gt;)<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       xor       eax,eax<br \/>\nM01_L00:<br \/>\n       cmp       edi,edx<br \/>\n       jae       short M01_L01<br \/>\n       lea       ecx,[rdi+1]<br \/>\n       mov       edi,edi<br \/>\n       movzx     edi,word ptr [rsi+rdi*2]<br \/>\n       add       eax,edi<br \/>\n       mov       edi,ecx<br \/>\n       jmp       short M01_L00<br \/>\nM01_L01:<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\n; Total bytes of code 27<\/p>\n<p>But, this is a fairly awkward way to write such a condition. A more natural way would be to have that check as part of the loop condition:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Runtime.CompilerServices;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    [Arguments(3)]<br \/>\n    public int Test() =&gt; M(0, &#8220;1234567890abcdefghijklmnopqrstuvwxyz&#8221;);<\/p>\n<p>    [MethodImpl(MethodImplOptions.NoInlining)]<br \/>\n    public static int M(int i, ReadOnlySpan&lt;char&gt; src)<br \/>\n    {<br \/>\n        int sum = 0;<\/p>\n<p>        for (; (uint)i &lt; src.Length; i++)<br \/>\n        {<br \/>\n            sum += src[i];<br \/>\n        }<\/p>\n<p>        return sum;<br \/>\n    }<br \/>\n}<\/p>\n<p>Unfortunately, as a result of my code cleanup here to make the code more canonical, the JIT in .NET 8 fails to see that the bounds check can be elided\u2026 note the CORINFO_HELP_RNGCHKFAIL at the end:<\/p>\n<p>; Tests.M(Int32, System.ReadOnlySpan`1&lt;Char&gt;)<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       xor       eax,eax<br \/>\n       cmp       edi,edx<br \/>\n       jae       short M01_L01<br \/>\nM01_L00:<br \/>\n       cmp       edi,edx<br \/>\n       jae       short M01_L02<br \/>\n       mov       ecx,edi<br \/>\n       movzx     ecx,word ptr [rsi+rcx*2]<br \/>\n       add       eax,ecx<br \/>\n       inc       edi<br \/>\n       cmp       edi,edx<br \/>\n       jb        short M01_L00<br \/>\nM01_L01:<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM01_L02:<br \/>\n       call      CORINFO_HELP_RNGCHKFAIL<br \/>\n       int       3<br \/>\n; Total bytes of code 36<\/p>\n<p>But in .NET 9, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100777\">dotnet\/runtime#100777<\/a>, the JIT is better able to track the knowledge about guarantees made by the loop condition and is able to elide the bounds check on this variation as well.<\/p>\n<p>; Tests.M(Int32, System.ReadOnlySpan`1&lt;Char&gt;)<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       xor       eax,eax<br \/>\n       cmp       edi,edx<br \/>\n       jae       short M01_L01<br \/>\n       mov       ecx,edi<br \/>\nM01_L00:<br \/>\n       movzx     edi,word ptr [rsi+rcx*2]<br \/>\n       add       eax,edi<br \/>\n       inc       ecx<br \/>\n       cmp       ecx,edx<br \/>\n       jb        short M01_L00<br \/>\nM01_L01:<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\n; Total bytes of code 26<\/p>\n<p>Yay!<\/p>\n<p>Now consider this benchmark:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser(maxDepth: 0)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    [Arguments(3)]<br \/>\n    public int Test(int i)<br \/>\n    {<br \/>\n        ReadOnlySpan&lt;byte&gt; rva = [1, 2, 3, 5, 8, 13, 21, 34];<br \/>\n        return rva[7 &#8211; (i &amp; 7)];<br \/>\n    }<br \/>\n}<\/p>\n<p>The test method here has a span of data initialized in a way where the JIT is able to see how long it is. It\u2019s then indexing into the span, using the supplied index to read not from the start but from the end (the (i &amp; 7) is there to ensure the JIT can see that the value will always be in range); if it were reading from the start, this was already optimized, but from the end, the JIT hadn\u2019t previously been taught how to reason about the bounds checks. On .NET 8, it can\u2019t prove the access is always in-bounds, and we can see the bounds check in place:<\/p>\n<p>; Tests.Test(Int32)<br \/>\n       push      rax<br \/>\n       and       esi,7<br \/>\n       mov       eax,esi<br \/>\n       neg       eax<br \/>\n       add       eax,7<br \/>\n       cmp       eax,8<br \/>\n       jae       short M00_L00<br \/>\n       mov       rcx,7FC98A741EC8<br \/>\n       movzx     eax,byte ptr [rax+rcx]<br \/>\n       add       rsp,8<br \/>\n       ret<br \/>\nM00_L00:<br \/>\n       call      CORINFO_HELP_RNGCHKFAIL<br \/>\n       int       3<br \/>\n; Total bytes of code 41<\/p>\n<p>But, now on .NET 9, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96123\">dotnet\/runtime#96123<\/a>, the bounds check gets elided.<\/p>\n<p>; Tests.Test(Int32)<br \/>\n       and       esi,7<br \/>\n       mov       eax,esi<br \/>\n       neg       eax<br \/>\n       add       eax,7<br \/>\n       mov       rcx,7F39B8724EC8<br \/>\n       movzx     eax,byte ptr [rax+rcx]<br \/>\n       ret<br \/>\n; Total bytes of code 25<\/p>\n<p>Here\u2019s another case. We\u2019re special-casing spans of lengths less than or equal to 1, returning string.Empty if the span is of length 0 or returning the first string if the span is of length 1:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Runtime.CompilerServices;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    [Arguments(3)]<br \/>\n    public string? Test() =&gt; M([&#8220;123&#8221;]);<\/p>\n<p>    [MethodImpl(MethodImplOptions.NoInlining)]<br \/>\n    private static string? M(ReadOnlySpan&lt;string&gt; values)<br \/>\n    {<br \/>\n        if (values.Length &lt;= 1)<br \/>\n        {<br \/>\n            return values.Length == 0 ?<br \/>\n                string.Empty :<br \/>\n                values[0];<br \/>\n        }<\/p>\n<p>        return null;<br \/>\n    }<br \/>\n}<\/p>\n<p>You and I can see that the access to values[0] will always succeed, but on .NET 8 we get this:<\/p>\n<p>; Tests.M(System.ReadOnlySpan`1&lt;System.String&gt;)<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       cmp       esi,1<br \/>\n       jg        short M01_L01<br \/>\n       test      esi,esi<br \/>\n       je        short M01_L00<br \/>\n       test      esi,esi<br \/>\n       je        short M01_L02<br \/>\n       mov       rax,[rdi]<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM01_L00:<br \/>\n       mov       rax,7FB62147C008<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM01_L01:<br \/>\n       xor       eax,eax<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM01_L02:<br \/>\n       call      CORINFO_HELP_RNGCHKFAIL<br \/>\n       int       3<br \/>\n; Total bytes of code 44<\/p>\n<p>The JIT keeps track of what it knows about the lengths of various things, what conditions it\u2019s proved, but here it\u2019s lost track of the fact that, for the else branch of the ternary, values is guaranteed to be of length 1, and thus indexing at index 0 is safe. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101323\">dotnet\/runtime#101323<\/a> improves the JIT\u2019s range tracking ability, such that on .NET 9, the bounds check is successfully elided:<\/p>\n<p>; Tests.M(System.ReadOnlySpan`1&lt;System.String&gt;)<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       cmp       esi,1<br \/>\n       jg        short M01_L01<br \/>\n       test      esi,esi<br \/>\n       je        short M01_L00<br \/>\n       mov       rax,[rdi]<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM01_L00:<br \/>\n       mov       rax,7F5700FB1008<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM01_L01:<br \/>\n       xor       eax,eax<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\n; Total bytes of code 34<\/p>\n<p>Most if not all of these bounds check elimination improvements come about because someone is optimizing something and sees a bounds check that could have been eliminated but wasn\u2019t. In the case that inspired the improvement in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101352\">dotnet\/runtime#101352<\/a>, that someone was me, while working on improving Enum for .NET 8. Enums can be backed by various numerical types, including ulong, and there\u2019s a code path in Enum.GetName that\u2019s effectively this:<\/p>\n<p>if (ulongValue &lt; (ulong)names.Length)<br \/>\n{<br \/>\n    return names[(uint)ulongValue];<br \/>\n}<\/p>\n<p>That bounds check wasn\u2019t previously being removed, but now in .NET 9, it is:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private readonly string[] _names = Enum.GetNames&lt;MyEnum&gt;();<\/p>\n<p>    [Benchmark]<br \/>\n    [Arguments(2)]<br \/>\n    public string? GetNameOrNull(ulong ulValue)<br \/>\n    {<br \/>\n        string[] names = _names;<br \/>\n        return ulValue &lt; (ulong)names.Length ?<br \/>\n            names[(uint)ulValue] :<br \/>\n            null;<br \/>\n    }<\/p>\n<p>    public enum MyEnum : ulong { A, B, C, D }<br \/>\n}<br \/>\n\/\/ .NET 8<br \/>\n; Tests.GetNameOrNull(UInt64)<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       mov       rax,[rdi+8]<br \/>\n       mov       ecx,[rax+8]<br \/>\n       mov       edx,ecx<br \/>\n       cmp       rdx,rsi<br \/>\n       jbe       short M00_L00<br \/>\n       cmp       esi,ecx<br \/>\n       jae       short M00_L01<br \/>\n       mov       ecx,esi<br \/>\n       mov       rax,[rax+rcx*8+10]<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM00_L00:<br \/>\n       xor       eax,eax<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM00_L01:<br \/>\n       call      CORINFO_HELP_RNGCHKFAIL<br \/>\n       int       3<br \/>\n; Total bytes of code 41<\/p>\n<p>\/\/ .NET 9<br \/>\n; Tests.GetNameOrNull(UInt64)<br \/>\n       mov       rax,[rdi+8]<br \/>\n       mov       ecx,[rax+8]<br \/>\n       cmp       rcx,rsi<br \/>\n       jbe       short M00_L00<br \/>\n       mov       ecx,esi<br \/>\n       mov       rax,[rax+rcx*8+10]<br \/>\n       ret<br \/>\nM00_L00:<br \/>\n       xor       eax,eax<br \/>\n       ret<br \/>\n; Total bytes of code 23<\/p>\n<p>Sometimes eliding bounds checks is about learning new tricks; other times, it\u2019s about fixing old ones. Consider this benchmark:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private static ReadOnlySpan&lt;int&gt; Lookup =&gt; [1, 2, 3, 5, 8, 13, 21];<\/p>\n<p>    [Benchmark]<br \/>\n    [Arguments(3)]<br \/>\n    public int Test1(int i) =&gt; (uint)i &lt; 7 ? Lookup[i] : -1;<\/p>\n<p>    [Benchmark]<br \/>\n    [Arguments(3)]<br \/>\n    public int Test2(int i) =&gt; (uint)i &lt;= 6 ? Lookup[i] : -1;<br \/>\n}<\/p>\n<p>Test1 and Test2 are effectively the same thing, both guarding a lookup table by a known length and only accessing the table if we know the index to be in bounds. The bounds check will then be elided by the JIT in both cases, right? Wrong. On .NET 8, we get this:<\/p>\n<p>; Tests.Test1(Int32)<br \/>\n       cmp       esi,7<br \/>\n       jae       short M00_L00<br \/>\n       mov       eax,esi<br \/>\n       mov       rcx,7F6D40064030<br \/>\n       mov       eax,[rcx+rax*4]<br \/>\n       ret<br \/>\nM00_L00:<br \/>\n       mov       eax,0FFFFFFFF<br \/>\n       ret<br \/>\n; Total bytes of code 27<\/p>\n<p>; Tests.Test2(Int32)<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       cmp       esi,6<br \/>\n       ja        short M00_L00<br \/>\n       cmp       esi,7<br \/>\n       jae       short M00_L01<br \/>\n       mov       eax,esi<br \/>\n       mov       rcx,7F8D11621030<br \/>\n       mov       eax,[rcx+rax*4]<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM00_L00:<br \/>\n       mov       eax,0FFFFFFFF<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM00_L01:<br \/>\n       call      CORINFO_HELP_RNGCHKFAIL<br \/>\n       int       3<br \/>\n; Total bytes of code 44<\/p>\n<p>Note the bounds check in Test2. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97908\">dotnet\/runtime#97908<\/a> fixes this, such that on .NET 9 Test2 now successfully elides the bounds check as well:<\/p>\n<p>; Tests.Test1(Int32)<br \/>\n       cmp       esi,7<br \/>\n       jae       short M00_L00<br \/>\n       mov       eax,esi<br \/>\n       mov       rcx,7F5B9DC5E030<br \/>\n       mov       eax,[rcx+rax*4]<br \/>\n       ret<br \/>\nM00_L00:<br \/>\n       mov       eax,0FFFFFFFF<br \/>\n       ret<br \/>\n; Total bytes of code 27<\/p>\n<p>; Tests.Test2(Int32)<br \/>\n       cmp       esi,6<br \/>\n       ja        short M00_L00<br \/>\n       mov       eax,esi<br \/>\n       mov       rcx,7F7FDE2C9030<br \/>\n       mov       eax,[rcx+rax*4]<br \/>\n       ret<br \/>\nM00_L00:<br \/>\n       mov       eax,0FFFFFFFF<br \/>\n       ret<br \/>\n; Total bytes of code 27<\/p>\n<p>Interestingly, sometimes even if we can\u2019t elide a bounds check, we can learn things from the fact that one occurred, and then use that knowledge to optimize subsequent things. Consider this benchmark:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private readonly int[] _x = new int[10];<\/p>\n<p>    [Benchmark]<br \/>\n    [Arguments(2)]<br \/>\n    public int Add(int y) =&gt; _x[y] + (y % 8);<br \/>\n}<\/p>\n<p>There\u2019s nothing the JIT can do here to elide the bounds check on _x[y]; it has no information about the value of y or the length of _x. As such, as shown in the .NET 8 assembly here, we see a bounds check:<\/p>\n<p>; Tests.Add(Int32)<br \/>\n       push      rax<br \/>\n       mov       rax,[rdi+8]<br \/>\n       cmp       esi,[rax+8]<br \/>\n       jae       short M00_L00<br \/>\n       mov       ecx,esi<br \/>\n       mov       edx,esi<br \/>\n       sar       edx,1F<br \/>\n       and       edx,7<br \/>\n       add       edx,esi<br \/>\n       and       edx,0FFFFFFF8<br \/>\n       mov       edi,esi<br \/>\n       sub       edi,edx<br \/>\n       add       edi,[rax+rcx*4+10]<br \/>\n       mov       eax,edi<br \/>\n       add       rsp,8<br \/>\n       ret<br \/>\nM00_L00:<br \/>\n       call      CORINFO_HELP_RNGCHKFAIL<br \/>\n       int       3<br \/>\n; Total bytes of code 46<\/p>\n<p>However, all is not lost. After indexing into the array, we proceed to use y as the numerator of a % operation. C#\u2019s % operator supports both int and uint numerators, but it has to do a little more work for int in case the value is negative. However, by the time we get to that % operation, we <em>know<\/em> that y is not negative, as if it were negative, the _x[y] would have thrown and we\u2019d never end up here. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102089\">dotnet\/runtime#102089<\/a> teaches the JIT how to learn such non-negative information from such bounds checks, such that in .NET 9, we get code generation equivalent to if we\u2019d explicitly cast y to uint.<\/p>\n<p>; Tests.Add(Int32)<br \/>\n       push      rax<br \/>\n       mov       rax,[rdi+8]<br \/>\n       cmp       esi,[rax+8]<br \/>\n       jae       short M00_L00<br \/>\n       mov       ecx,esi<br \/>\n       and       esi,7<br \/>\n       add       esi,[rax+rcx*4+10]<br \/>\n       mov       eax,esi<br \/>\n       add       rsp,8<br \/>\n       ret<br \/>\nM00_L00:<br \/>\n       call      CORINFO_HELP_RNGCHKFAIL<br \/>\n       int       3<br \/>\n; Total bytes of code 32<\/p>\n<h3>Arm64<\/h3>\n<p>Making .NET on Arm an awesome and fast experience has been a critical, multi-year investment. You can read about it in <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/Arm64-performance-in-net-5\/\">Arm64 Performance Improvements in .NET 5<\/a>, <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/Arm64-performance-improvements-in-dotnet-7\/\">Arm64 Performance Improvements in .NET 7<\/a>, and <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/this-Arm64-performance-in-dotnet-8\/\">Arm64 Performance Improvements in .NET 8<\/a>. And things continue to improve even further in .NET 9. Here are some examples:<\/p>\n<p><strong>Better barriers.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/91553\">dotnet\/runtime#91553<\/a> implements volatile writes via using the stlur (Store-Release Register) instruction rather than a dmb (Data Memory Barrier) \/ str (Store) pair of instructions (stlur is generally cheaper). Similarly, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101359\">dotnet\/runtime#101359<\/a> eliminates full memory barriers when dealing with volatile reads and writes on floats. For example, code that would previously have produced a ldr (Load Register) \/ dmb pair may now produce a ldar (Load-Acquire Register) \/ fmov (Floating-point Move) pair.<br \/>\n<strong>Better switches.<\/strong> Depending on the shape of a switch statement, the C# compiler may generate a variety of IL patterns, one of which is to use a switch IL instruction. Normally for a switch IL instruction, the JIT will generate a jump table, but for some forms, it has an optimization to instead rely on a bit test. Thus far this optimization only existed for x86\/64, with the bt (Bit Test) instruction. Now with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/91811\">dotnet\/runtime#91811<\/a>, it also exists for Arm, with the tbz (Test bit and Branch if Zero) instruction.<br \/>\n<strong>Better conditionals.<\/strong> Arm has conditional instructions that logically contain a branch albeit without any branching, e.g. <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-8\/\">Performance Improvements in .NET 8<\/a> talked about the csel (Conditional Select) instruction that \u201cconditionally selects\u201d a value from one of two registers based on some condition. Another such instruction is csinc (Conditional Select Increment), which conditionally selects either the value from one register or the value from another register incremented by one. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/91262\">dotnet\/runtime#91262<\/a> from <a href=\"https:\/\/github.com\/c272\">@c272<\/a> enables the JIT to utilize csinc, so that a statement like x = condition ? x + 1 : y; will be able to compile down to a csinc rather than to a branching construct. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92810\">dotnet\/runtime#92810<\/a> also improves the custom comparison operation the JIT emits for some SequenceEqual operations (e.g. &#8220;hello, there&#8221;u8.SequenceEqual(spanOfBytes)) to be able to use ccmp (Conditional Compare).<br \/>\n<strong>Better multiplies.<\/strong> Arm has single instructions that represent doing a multiply followed by an addition, subtraction, or negation. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/91886\">dotnet\/runtime#91886<\/a> from <a href=\"https:\/\/github.com\/c272\">@c272<\/a> finds such sequences of multiplies followed by one of those operations and consolidates them to use the single combined instruction.<br \/>\n<strong>Better loads.<\/strong> Arm has instructions for loading a value from memory into a single register, but it also has instructions for loading multiple values into multiple registers. When the JIT emits a customized memory copy (such as for byteArray.AsSpan(0, 32).SequenceEqual(otherByteArray)), it may emit multiple ldr instructions for loading a value into a register. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92704\">dotnet\/runtime#92704<\/a> enables consolidating pairs of those into ldp (Load Pair of Registers) instructions, which load two values into two registers.<\/p>\n<h3>ARM SVE<\/h3>\n<p>Bringing up a new instruction set is a huge deal and a huge undertaking. I\u2019ve mentioned in the past my process for gearing up to write one of these \u201cPerformance Improvements in .NET X\u201d posts, including that throughout the year I keep a running list of the PRs I might want to talk about when it comes time to actually put pen to paper. Just for \u201cSVE\u201d, I found myself with over 200 links. I\u2019m not going to bore you with such a laundry list; if you\u2019re interested, you can search for <a href=\"https:\/\/github.com\/dotnet\/runtime\/pulls?q=is%3Apr+SVE+merged%3A2023-10-01..2024-08-31+\">SVE PRs<\/a>, which includes PRs from <a href=\"https:\/\/github.com\/a74nh\">@a74nh<\/a>, from <a href=\"https:\/\/github.com\/ebepho\">@ebepho<\/a>, from <a href=\"https:\/\/github.com\/mikabl-arm\">@mikabl-arm<\/a>, from <a href=\"https:\/\/github.com\/snickolls-arm\">@snickolls-arm<\/a>, and from <a href=\"https:\/\/github.com\/SwapnilGaikwad\">@SwapnilGaikwad<\/a>. But, we can still talk a bit about what it is and what it means for .NET.<\/p>\n<p>Single instruction, multiple data (SIMD) is a kind of parallel processing where one instruction performs the same operation on multiple pieces of data at the same time, rather than one instruction manipulating just a single piece of data. For example, the add instruction on x86\/64 can add together one pair of 32-bit integers, whereas the paddd (Add Packed Doubleword Integers) instruction that\u2019s part of Intel\u2019s SSE2 (Streaming SIMD Extensions 2) instruction set operates on a pair of xmm registers that can each store four 32-bit integer values at once. Many such instructions have been added to many different hardware platforms over the years, coming in groups referred to as instruction set architectures (ISA), where an ISA defines what the instructions are, what registers they interact with, how memory is accessed, and so on. Even if you\u2019re not steeped in this stuff, you\u2019ve likely heard names of these ISAs mentioned, like Intel\u2019s SSE (Streaming SIMD Extensions) and AVX (Advanced Vector Extensions), or Arm\u2019s Advanced SIMD (also known as Neon). In general, the instructions in all of these ISAs operate on a fixed number of values of a fixed size, e.g. the paddd previously mentioned only works with 128-bits at a time, no more, no less. Different instructions exist for 256 bits at a time or 512 bits at a time.<\/p>\n<p>SVE, or \u201cScalable Vector Extensions,\u201d is an ISA from Arm that\u2019s a bit different. The instructions in SVE don\u2019t operate on a fixed size. Rather, the specification allows for them to operate on sizes from 128 bits up to 2048 bits, and the specific hardware can choose which size to use (allowed sizes are multiples of 128, and with SVE 2 further constrained to be powers of 2). The same assembly code using these instructions might operate on 128 bits at a time on one piece of hardware and 256 bits at a time on another piece of hardware.<\/p>\n<p>There are multiple ways such an ISA impacts .NET, and in particular the JIT. The JIT needs to be able to be able to work with the ISA, understand the associated registers and be able to do register allocation, be taught about encoding and emitting the instructions, and so on. The JIT needs to be taught when and where it\u2019s appropriate to use these instructions, so that as part of compiling IL down to assembly, if operating on a machine that supports SVE, the JIT might be able to pick SVE instructions for use in the generated assembly. And the JIT needs to be taught how to represent this data, these vectors, to user code. All of that is a huge amount of work, especially when you consider that there are thousands of operations represented. What makes it even more work is hardware intrinsics.<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/hardware-intrinsics-in-net-core\/\">Hardware intrinsics<\/a> are a feature of .NET where, effectively, each of these instructions shows up as its own dedicated .NET method, such as <a href=\"https:\/\/learn.microsoft.com\/dotnet\/api\/system.runtime.intrinsics.x86.sse2.add?view=net-8.0#system-runtime-intrinsics-x86-sse2-add\">Sse2.Add<\/a>, and the JIT emits use of that method as the underlying instruction to which it maps. If you look at <a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/30eaaf2415b8facf0ef3180c005e27132e334611\/src\/libraries\/System.Private.CoreLib\/src\/System\/Runtime\/Intrinsics\/Arm\/Sve.cs\">Sve.cs<\/a> in dotnet\/runtime, you\u2019ll see the System.Runtime.Intrinsics.Arm.Sve type, which already exposes more than 1400 public methods (that number is not a typo).<\/p>\n<p>Two interesting things to notice if you open that file (beyond its sheer length):<\/p>\n<p><strong>The use of Vector&lt;T&gt;.<\/strong> .NET\u2019s foray into SIMD <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/the-jit-finally-proposed-jit-and-simd-are-getting-married\/\">started in 2014<\/a> and was accompanied by the Vector&lt;T&gt; type. Vector&lt;T&gt; represents a single vector (list) of the T numeric type. To provide a platform-agnostic representation, since different platforms were capable of different vector widths, Vector&lt;T&gt; was defined to be variable in size, so for example on x86\/x64 hardware that supported AVX2, Vector&lt;T&gt; might be 256 bits wide, whereas on an Arm machine that supported Neon, Vector&lt;T&gt; might be 128 bits wide. If the hardware supported both 128 bits and 256 bits, Vector&lt;T&gt; would map to the larger. Since the introduction of Vector&lt;T&gt;, various fixed-width vector types have been introduced, like Vector64&lt;T&gt;, Vector128&lt;T&gt;, Vector256&lt;T&gt;, and Vector512&lt;T&gt;, and the hardware intrinsics for most of the other ISAs are all in terms of those fixed-width vector sizes, since the instructions themselves are fixed width. But SVE is not; its instructions might be 128 bits here and 512 bits there, thus it\u2019s not possible to use those same fixed-width vector types in the Sve definition\u2026 but it makes a lot of sense to use the variable with Vector&lt;T&gt;. What\u2019s old is new again.<br \/>\n<strong>The Sve class is tagged as [Experimental].<\/strong> The <a href=\"https:\/\/learn.microsoft.com\/dotnet\/api\/system.diagnostics.codeanalysis.experimentalattribute?view=net-8.0\">[Experimental]<\/a> attribute was introduced in .NET 8 and C# 12. The intent is it can be used to indicate that some functionality in an otherwise stable assembly is not yet stable and may change in the future. If code tries to use such a member, by default the C# compiler will issue an error telling the developer they\u2019re using something that could break in the future. As long as the developer is willing to accept such breaking change risk, they can then suppress the error. Designing and enabling the SVE support is a monstrous, multi-year effort, and while the support is functional and folks are encouraged to take it for a spin, it\u2019s not yet baked enough for us to be 100% confident the shape won\u2019t need to evolve (for .NET 9, it\u2019s also restricted to hardware with a vector width of 128 bits, but that restriction will be removed subsequently). Hence, [Experimental].<\/p>\n<h3>AVX10.1<\/h3>\n<p>Even with the size of the SVE effort, it\u2019s not the only new ISA available in .NET 9. Thanks in large part to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99784\">dotnet\/runtime#99784<\/a> from <a href=\"https:\/\/github.com\/Ruihan-Yin\">@Ruihan-Yin<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101938\">dotnet\/runtime#101938<\/a> from <a href=\"https:\/\/github.com\/khushal1996\">@khushal1996<\/a>, .NET 9 now also supports AVX10.1 (AVX10 version 1). AVX10.1 provides everything AVX512 provides, all of the base support, the updated encodings, support for embedded broadcasts, masking, and so on, but it only requires 256-bit support in the hardware (with 512-bits being optional, whereas AVX512 requires 512-bit support), and it does so in a much less incremental manner (AVX512 has multiple instruction sets like \u201cF\u201d, \u201cDQ\u201d, \u201cVbmi\u201d, etc.). That\u2019s modeled in the .NET APIs as well, where you can check Avx10v1.IsSupported as well as Avx10v1.V512.IsSupported, both of which govern more than 500 new APIs available for consumption. (Note that at the time of this writing, there aren\u2019t actually any chips on the market that support AVX10.1, but they\u2019re expected in the foreseeable future.)<\/p>\n<h3>AVX512<\/h3>\n<p>On the subject of ISAs, it\u2019s worth mentioning AVX512. .NET 8 added broad support for AVX512, including support in the JIT and employment of it throughout the libraries. Both of those improve further in .NET 9. We\u2019ll talk more about places it\u2019s better used in the libraries later. For now, here are some JIT-specific improvements.<\/p>\n<p>One of the things the JIT needs to generate code for is zeroing, e.g. by default all locals in a method need to be set to zero, and even if [SkipLocalsInit] is employed, references still need to be zeroed (otherwise, when the GC does a pass through all of the locals looking for references to objects to see what\u2019s no longer referenced, it could see the references as being whatever garbage happened to be in that location in memory and end up making bad choices). Such zeroing of locals is overhead that occurs on every invocation of that method, so obviously it\u2019s valuable for that to be as efficient as possible. Rather than zeroing out each word with a single instruction, if the current hardware supports the appropriate SIMD instructions, the JIT can instead emit code to use those instructions, so that it can zero out more per instruction. With <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/91166\">dotnet\/runtime#91166<\/a>, it\u2019s now able to use AVX512 instructions if available to zero out 512 bits per instruction, rather than \u201conly\u201d 256 bits or 128 bits using other ISAs. As an example, here\u2019s a benchmark that needs to zero out 256 bytes:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Runtime.CompilerServices;<br \/>\nusing System.Runtime.InteropServices;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic unsafe class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public void Sum()<br \/>\n    {<br \/>\n        Bytes values;<br \/>\n        Nop(&amp;values);<br \/>\n    }<\/p>\n<p>    [SkipLocalsInit]<br \/>\n    [Benchmark]<br \/>\n    public void SumSkipLocalsInit()<br \/>\n    {<br \/>\n        Bytes values;<br \/>\n        Nop(&amp;values);<br \/>\n    }<\/p>\n<p>    [MethodImpl(MethodImplOptions.NoInlining)]<br \/>\n    private static void Nop(Bytes* value) { }<\/p>\n<p>    [StructLayout(LayoutKind.Sequential, Size = 256)]<br \/>\n    private struct Bytes { }<br \/>\n}<\/p>\n<p>Here\u2019s the assembly for Sum on .NET 8:<\/p>\n<p>; Tests.Sum()<br \/>\n       sub       rsp,108<br \/>\n       xor       eax,eax<br \/>\n       mov       [rsp+8],rax<br \/>\n       vxorps    xmm8,xmm8,xmm8<br \/>\n       mov       rax,0FFFFFFFFFFFFFF10<br \/>\nM00_L00:<br \/>\n       vmovdqa   xmmword ptr [rsp+rax+100],xmm8<br \/>\n       vmovdqa   xmmword ptr [rsp+rax+110],xmm8<br \/>\n       vmovdqa   xmmword ptr [rsp+rax+120],xmm8<br \/>\n       add       rax,30<br \/>\n       jne       short M00_L00<br \/>\n       mov       [rsp+100],rax<br \/>\n       lea       rdi,[rsp+8]<br \/>\n       call      qword ptr [7F6B56B85CB0]; Tests.Nop(Bytes*)<br \/>\n       nop<br \/>\n       add       rsp,108<br \/>\n       ret<br \/>\n; Total bytes of code 90<\/p>\n<p>This is on a machine with AVX512 hardware support, but we can see the zero\u2019ing is happening using a loop (M00_L00 through to the jne that jumps back to it), as with only 256-bit instructions, this was deemed by the JIT\u2019s heuristics too large to unroll completely. Now, here\u2019s .NET 9:<\/p>\n<p>; Tests.Sum()<br \/>\n       sub       rsp,108<br \/>\n       xor       eax,eax<br \/>\n       mov       [rsp+8],rax<br \/>\n       vxorps    xmm8,xmm8,xmm8<br \/>\n       vmovdqu32 [rsp+10],zmm8<br \/>\n       vmovdqu32 [rsp+50],zmm8<br \/>\n       vmovdqu32 [rsp+90],zmm8<br \/>\n       vmovdqa   xmmword ptr [rsp+0D0],xmm8<br \/>\n       vmovdqa   xmmword ptr [rsp+0E0],xmm8<br \/>\n       vmovdqa   xmmword ptr [rsp+0F0],xmm8<br \/>\n       mov       [rsp+100],rax<br \/>\n       lea       rdi,[rsp+8]<br \/>\n       call      qword ptr [7F4D3D3A44C8]; Tests.Nop(Bytes*)<br \/>\n       nop<br \/>\n       add       rsp,108<br \/>\n       ret<br \/>\n; Total bytes of code 107<\/p>\n<p>Now there\u2019s no loop, because vmovdqu32 (Move unaligned packed doubleword integer values) can be used to zero twice as much at a time (64 bytes) as vmovdqa (Move aligned packed integer values), and thus the zeroing can be done in fewer instructions that\u2019s still considered a reasonable number.<\/p>\n<p>Zeroing also shows up elsewhere, such as when initializing structs. Those have also previously employed SIMD instructions where relevant, e.g. this:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public MyStruct Init() =&gt; new();<\/p>\n<p>    public struct MyStruct<br \/>\n    {<br \/>\n        public Int128 A, B, C, D;<br \/>\n    }<br \/>\n}<\/p>\n<p>produces this assembly today on .NET 8:<\/p>\n<p>; Tests.Init()<br \/>\n       vzeroupper<br \/>\n       vxorps    ymm0,ymm0,ymm0<br \/>\n       vmovdqu32 [rsi],zmm0<br \/>\n       mov       rax,rsi<br \/>\n       ret<br \/>\n; Total bytes of code 17<\/p>\n<p>But, if we tweak MyStruct to add a field of a reference type anywhere in the struct (e.g. add public string Oops; as the first line of the struct above), it knocks the initialization off this optimized path, and we end up with initialization like this on .NET 8:<\/p>\n<p>; Tests.Init()<br \/>\n       xor       eax,eax<br \/>\n       mov       [rsi],rax<br \/>\n       mov       [rsi+8],rax<br \/>\n       mov       [rsi+10],rax<br \/>\n       mov       [rsi+18],rax<br \/>\n       mov       [rsi+20],rax<br \/>\n       mov       [rsi+28],rax<br \/>\n       mov       [rsi+30],rax<br \/>\n       mov       [rsi+38],rax<br \/>\n       mov       [rsi+40],rax<br \/>\n       mov       [rsi+48],rax<br \/>\n       mov       rax,rsi<br \/>\n       ret<br \/>\n; Total bytes of code 45<\/p>\n<p>This is due to alignment requirements in order to provide necessary atomicity guarantees. But rather than giving up wholesale, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102132\">dotnet\/runtime#102132<\/a> allows the SIMD zeroing to be used for the contiguous portions that don\u2019t contain GC references, so now on .NET 9 we get this:<\/p>\n<p>; Tests.Init()<br \/>\n       xor       eax,eax<br \/>\n       mov       [rsi],rax<br \/>\n       vxorps    xmm0,xmm0,xmm0<br \/>\n       vmovdqu32 [rsi+8],zmm0<br \/>\n       mov       [rsi+48],rax<br \/>\n       mov       rax,rsi<br \/>\n       ret<br \/>\n; Total bytes of code 27<\/p>\n<p>This optimization isn\u2019t specific to AVX512, but it includes the ability to use AVX512 instructions when available. (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99140\">dotnet\/runtime#99140<\/a> provides similar support for Arm64.)<\/p>\n<p>Other optimizations improve the JIT\u2019s ability to select AVX512 instructions as part of generating code. One neat example of this is <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/91227\">dotnet\/runtime#91227<\/a> from <a href=\"https:\/\/github.com\/Ruihan-Yin\">@Ruihan-Yin<\/a>, which utilizes the cool vpternlog (Bitwise Ternary Logic) instruction. Imagine you have three bools (a, b, and c), and you want to perform a series of Boolean operations on them, e.g. a ? (b ^ c) : (b &amp; c). If you were to naively compile that down, you\u2019d end up with branches. We could make it branchless by distributing the a to both sides of the ternary, e.g. (a &amp; (b ^ c)) | (!a &amp; (b &amp; c)), but now we\u2019ve gone from one branch and one Boolean operation to six Boolean operations. What if instead we could do all of that in a single instruction <em>and<\/em> do it for all of the lanes in a vector at the same time so it could be applied to multiple values as part of a SIMD operation? That\u2019d be cool, right? That\u2019s what vpternlog enables. Try running this:<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0<\/p>\n<p>internal class Program<br \/>\n{<br \/>\n    private static bool Exp(bool a, bool b, bool c) =&gt; (a &amp; (b ^ c)) | (!a &amp; b &amp; c);<\/p>\n<p>    private static void Main()<br \/>\n    {<br \/>\n        Console.WriteLine(&#8220;a b c result&#8221;);<br \/>\n        Console.WriteLine(&#8220;&#8212;&#8212;&#8212;&#8212;&#8220;);<br \/>\n        int control = 0;<br \/>\n        foreach (var (a, b, c, result) in from a in new[] { true, false }<br \/>\n                                          from b in new[] { true, false }<br \/>\n                                          from c in new[] { true, false }<br \/>\n                                          select (a, b, c, Exp(a, b, c)))<br \/>\n        {<br \/>\n            Console.WriteLine($&#8221;{Convert.ToInt32(a)} {Convert.ToInt32(b)} {Convert.ToInt32(c)} {Convert.ToInt32(result)}&#8221;);<br \/>\n            control = control &lt;&lt; 1 | Convert.ToInt32(result);<br \/>\n        }<br \/>\n        Console.WriteLine(&#8220;&#8212;&#8212;&#8212;&#8212;&#8220;);<br \/>\n        Console.WriteLine($&#8221;Control: {control:b8} == 0x{control:X2}&#8221;);<br \/>\n    }<br \/>\n}<\/p>\n<p>Here we\u2019ve put our Boolean operation into an Exp function, which is then being invoked for all 8 possible combinations of inputs (each of the three bools each having two possible values). We\u2019re then printing out the resulting \u201ctruth table,\u201d that details the Boolean output for each possible input. With this particular Boolean expression, that yields this truth table being output:<\/p>\n<p>a b c result<br \/>\n&#8212;&#8212;&#8212;&#8212;<br \/>\n1 1 1 0<br \/>\n1 1 0 1<br \/>\n1 0 1 1<br \/>\n1 0 0 0<br \/>\n0 1 1 1<br \/>\n0 1 0 0<br \/>\n0 0 1 0<br \/>\n0 0 0 0<br \/>\n&#8212;&#8212;&#8212;&#8212;<\/p>\n<p>We then take that last result column and we treat it as a binary number:<\/p>\n<p>Control: 01101000 == 0x68<\/p>\n<p>So the values are 0 1 1 0 1 0 0 0, which we read as the binary 0b01101000, which is 0x68. That byte is used as a \u201ccontrol code\u201d to the vpternlog instruction to encode which of the 256 possible truth tables that exist for any possible (deterministic) Boolean combination of those inputs is being chosen. This PR then teaches the JIT how to analyze the tree structures produced by the JIT to recognize such sequences of Boolean operations, compute the control code, and substitute in the use of the better instruction. Of course, the JIT isn\u2019t going to do the enumeration I did above; turns out there\u2019s a more efficient way to compute the control code, performing the same sequence of operations but on specific byte values instead of Booleans, e.g. this:<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0<\/p>\n<p>Console.WriteLine($&#8221;0x{Exp(0xF0, 0xCC, 0xAA):X2}&#8221;);<br \/>\nstatic int Exp(int a, int b, int c) =&gt; (a &amp; (b ^ c)) | (~a &amp; b &amp; c);<\/p>\n<p>also yields:<\/p>\n<p>0x68<\/p>\n<p>Why those specific three values of 0xF0, 0xCC, and 0xAA? Let\u2019s expand them from hex to binary: 0b11110000, 0b11001100, 0b10101010. Look familiar? They\u2019re the columns for a, b, and c in the earlier truth table, so we\u2019re really just running this expression over each of the 8 rows in the table at the same time. Fun.<\/p>\n<p>Another neat example is in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92017\">dotnet\/runtime#92017<\/a> from <a href=\"https:\/\/github.com\/Ruihan-Yin\">@Ruihan-Yin<\/a>, which optimizes 512-bit vector constants via broadcast. \u201cbroadcast\u201d is a fancy way of saying \u201creplicate,\u201d or \u201ccopy to each.\u201d The instruction is used to take a single value and duplicate it to be used for each element of a vector. If, for example, I write:<\/p>\n<p>Vector512&lt;int&gt; vector = Vector512.Create(42);<\/p>\n<p>that\u2019s broadcasting the single value 42, replicating it 16 times to fill up the 512-bit vector. Now imagine I have the following C# code, which is creating a Vector512&lt;byte&gt; composed of the byte sequence for the hex digits, but manually replicated four times, to fill up the 64 bytes that compose a 512-bit vector.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Runtime.Intrinsics;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public Vector512&lt;byte&gt; HexLookupTable() =&gt;<br \/>\n        Vector512.Create(&#8220;0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef&#8221;u8);<br \/>\n}<\/p>\n<p>This would result in that whole byte sequence being stored in the assembly data section, and then the JIT would emit the code to load that data into the appropriate registers; no broadcasting. But instead, the JIT should be able to recognize that this is actually the same 16-byte sequence repeated four times, store the sequence once, and then use a broadcast to load and replicate that value to fill out the vector. With this PR, that\u2019s exactly what happens.<\/p>\n<p>\/\/ .NET 8<br \/>\n; Tests.HexLookupTable()<br \/>\n       push      rax<br \/>\n       vzeroupper<br \/>\n       vmovups   zmm0,[7FB205399700]<br \/>\n       vmovups   [rsi],zmm0<br \/>\n       mov       rax,rsi<br \/>\n       vzeroupper<br \/>\n       add       rsp,8<br \/>\n       ret<br \/>\n; Total bytes of code 31<\/p>\n<p>\/\/ .NET 9<br \/>\n; Tests.HexLookupTable()<br \/>\n       push      rax<br \/>\n       vbroadcasti32x4 zmm0,xmmword ptr [7F78F75290F0]<br \/>\n       vmovups   [rsi],zmm0<br \/>\n       mov       rax,rsi<br \/>\n       vzeroupper<br \/>\n       add       rsp,8<br \/>\n       ret<br \/>\n; Total bytes of code 28<\/p>\n<p>This is beneficial for a variety of reasons, including less data to store, less data to load, and if the register containing this state needed to be spilled (meaning something else needs to be put into the register, so the value currently in the register is temporarily stored in memory), reloading it is similarly cheaper.<\/p>\n<p>Two of the more far-reaching changes related to AVX512, though, come from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97675\">dotnet\/runtime#97675<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101886\">dotnet\/runtime#101886<\/a>, which do the work to enable the JIT to utilize AVX512 \u201cembedded masking.\u201d Masking is a commonly needed solution when writing SIMD code; anywhere you see a ConditionalSelect, that\u2019s masking. Consider again a ternary operation, e.g. a ? (b + c) : (b &#8211; c). Here, a would be considered the \u201cmask\u201d: anywhere it\u2019s true, the value of b + c is used, and anywhere it\u2019s false, the value of b &#8211; c is used. If each of these were Vector512&lt;byte&gt;, for example, it would look like this in C#:<\/p>\n<p>public static Vector512&lt;byte&gt; Exp(Vector512&lt;byte&gt; a, Vector512&lt;byte&gt; b, Vector512&lt;byte&gt; c) =&gt;<br \/>\n    Vector512.ConditionalSelect(a, b + c, b &#8211; c);<\/p>\n<p>And guess what I\u2019d get for assembly? You guessed it, our good friend vpternlogd:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Runtime.CompilerServices;<br \/>\nusing System.Runtime.Intrinsics;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public Vector512&lt;byte&gt; Test() =&gt; Exp(default, default, default);<\/p>\n<p>    [MethodImpl(MethodImplOptions.NoInlining)]<br \/>\n    public static Vector512&lt;byte&gt; Exp(Vector512&lt;byte&gt; a, Vector512&lt;byte&gt; b, Vector512&lt;byte&gt; c) =&gt;<br \/>\n        Vector512.ConditionalSelect(a, b + c, b &#8211; c);<br \/>\n}<br \/>\n; Tests.Exp(System.Runtime.Intrinsics.Vector512`1&lt;Byte&gt;, System.Runtime.Intrinsics.Vector512`1&lt;Byte&gt;, System.Runtime.Intrinsics.Vector512`1&lt;Byte&gt;)<br \/>\n       vzeroupper<br \/>\n       vmovups   zmm0,[rsp+48]<br \/>\n       vmovups   zmm1,[rsp+88]<br \/>\n       vpaddb    zmm2,zmm0,zmm1<br \/>\n       vpsubb    zmm0,zmm0,zmm1<br \/>\n       vmovups   zmm1,[rsp+8]<br \/>\n       vpternlogd zmm1,zmm2,zmm0,0CA<br \/>\n       vmovups   [rdi],zmm1<br \/>\n       mov       rax,rdi<br \/>\n       vzeroupper<br \/>\n       ret<br \/>\n; Total bytes of code 68<\/p>\n<p>We can see it\u2019s computing both the b + c (vpaddb zmm2,zmm0,zmm1) and the b &#8211; c (vpsubb zmm0,zmm0,zmm1), and it\u2019s then choosing between them based on the mask ([rsp+8], aka the a parameter). In this example, the mask a was being passed in and computed in a manner unknown to the ConditionalSelect. A more common scheme, however, is that the mask is computed as an argument to the ConditionalSelect. Let\u2019s say for example that instead of passing in a as a mask, we pass in Vector512.LessThan(b, c) as the mask:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Runtime.CompilerServices;<br \/>\nusing System.Runtime.Intrinsics;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public Vector512&lt;byte&gt; Test() =&gt; Exp(default, default);<\/p>\n<p>    [MethodImpl(MethodImplOptions.NoInlining)]<br \/>\n    public static Vector512&lt;byte&gt; Exp(Vector512&lt;byte&gt; b, Vector512&lt;byte&gt; c) =&gt;<br \/>\n        Vector512.ConditionalSelect(Vector512.LessThan(b, c), b + c, b &#8211; c);<br \/>\n}<\/p>\n<p>AVX512 supports this implicitly via embedded masking, which means that instructions can include the masking operation as part of them rather than performing the operation separately and then doing the masking via vpternlogd. Instructions like the comparison operation employed by LessThan can target storing the result into a new kind of register defined by AVX512, a mask register, and then that mask register can be used as part of other compound operations to incorporate the mask into them. .NET developers don\u2019t need to do anything to take advantage of this support, though: the JIT just uses the specialized masking instructions where it sees an opportunity to do so. For the previous example, on .NET 8, we\u2019d get this:<\/p>\n<p>; Tests.Exp(System.Runtime.Intrinsics.Vector512`1&lt;Byte&gt;, System.Runtime.Intrinsics.Vector512`1&lt;Byte&gt;)<br \/>\n       vzeroupper<br \/>\n       vmovups   zmm0,[rsp+8]<br \/>\n       vmovups   zmm1,[rsp+48]<br \/>\n       vpcmpltub k1,zmm0,zmm1<br \/>\n       vpmovm2b  zmm2,k1<br \/>\n       vpaddb    zmm3,zmm0,zmm1<br \/>\n       vpsubb    zmm0,zmm0,zmm1<br \/>\n       vpternlogd zmm2,zmm3,zmm0,0CA<br \/>\n       vmovups   [rdi],zmm2<br \/>\n       mov       rax,rdi<br \/>\n       vzeroupper<br \/>\n       ret<br \/>\n; Total bytes of code 70<\/p>\n<p>Here we still have a vpternlogd. But, with the aforementioned PRs, now here\u2019s what we get on .NET 9:<\/p>\n<p>; Tests.Exp(System.Runtime.Intrinsics.Vector512`1&lt;Byte&gt;, System.Runtime.Intrinsics.Vector512`1&lt;Byte&gt;)<br \/>\n       vmovups   zmm0,[rsp+8]<br \/>\n       vmovups   zmm1,[rsp+48]<br \/>\n       vpcmpltub k1,zmm0,zmm1<br \/>\n       vpsubb    zmm2,zmm0,zmm1<br \/>\n       vpaddb    zmm2{k1},zmm0,zmm1<br \/>\n       vmovups   [rdi],zmm2<br \/>\n       mov       rax,rdi<br \/>\n       vzeroupper<br \/>\n       ret<br \/>\n; Total bytes of code 54<\/p>\n<p>That vpcmpltub instruction is doing the LessThan between b and c and storing the result as a mask in the k1 masking register. The vpsubbfor the b &#8211; c still happens as it did before. But now the b + c operation is significantly different, and note there\u2019s no vpternlogd anymore. The vpternlogd and the vpaddb we previously saw have now effectively been folded into a single vpaddb instruction <em>with<\/em> the mask. The result of the b &#8211; c is sitting in the zmm2 register. The vpaddb instruction then performs the addition between zmm0 (b) and zmm1 (c), and uses the mask k1 to decide whether to use that addition result or the existing subtraction result in zmm2. (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97468\">dotnet\/runtime#97468<\/a> also enables some such usage of vpternlogd to instead use vblendmps. vblendmps is similar to vpternlogd except that it\u2019s specific to floating-point and works with one of the dedicated mask registers.)<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97529\">dotnet\/runtime#97529<\/a> also improved casting from double and float to integer types, in particular when AVX512 is available such that it can benefit from dedicated AVX512 instructions for the purpose, e.g. the VCVTTSD2USI (Convert With Truncation Scalar Double Precision Floating-Point Value to Unsigned Integer) instruction.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Linq;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private double[] _doubles = Enumerable.Range(0, 1024).Select(i =&gt; (double)i).ToArray();<br \/>\n    private ulong[] _ulongs = new ulong[1024];<\/p>\n<p>    [Benchmark]<br \/>\n    public void DoubleToUlong()<br \/>\n    {<br \/>\n        ReadOnlySpan doubles = _doubles;<br \/>\n        Span ulongs = _ulongs;<br \/>\n        for (int i = 0; i &lt; doubles.Length; i++)<br \/>\n        {<br \/>\n            ulongs[i] = (ulong)doubles[i];<br \/>\n        }<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nCode Size<\/p>\n<p>DoubleToUlong<br \/>\n.NET 8.0<br \/>\n1,386.5 ns<br \/>\n1.00<br \/>\n135 B<\/p>\n<p>DoubleToUlong<br \/>\n.NET 9.0<br \/>\n461.4 ns<br \/>\n0.33<br \/>\n102 B<\/p>\n<h3>Vectorization<\/h3>\n<p>In addition to improvements that teach the JIT about entirely new architectures, there have also been a plethora of improvements that simply help the JIT to better employ SIMD in general.<\/p>\n<p>One of my favorites is <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92852\">dotnet\/runtime#92852<\/a>, which merges consecutive stores into a single operation. Consider wanting to implement a method like bool.TryFormat:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private bool _value;<br \/>\n    private char[] _destination = new char[10];<\/p>\n<p>    [Benchmark]<br \/>\n    public bool TryFormat() =&gt; TryFormat(_destination, out _);<\/p>\n<p>    private bool TryFormat(char[] destination, out int charsWritten)<br \/>\n    {<br \/>\n        if (_value)<br \/>\n        {<br \/>\n            if (destination.Length &gt;= 4)<br \/>\n            {<br \/>\n                destination[0] = &#8216;T&#8217;;<br \/>\n                destination[1] = &#8216;r&#8217;;<br \/>\n                destination[2] = &#8216;u&#8217;;<br \/>\n                destination[3] = &#8216;e&#8217;;<br \/>\n                charsWritten = 4;<br \/>\n                return true;<\/p>\n<p>            }<br \/>\n        }<br \/>\n        else<br \/>\n        {<br \/>\n            if (destination.Length &gt;= 5)<br \/>\n            {<br \/>\n                destination[0] = &#8216;F&#8217;;<br \/>\n                destination[1] = &#8216;a&#8217;;<br \/>\n                destination[2] = &#8216;l&#8217;;<br \/>\n                destination[3] = &#8216;s&#8217;;<br \/>\n                destination[4] = &#8216;e&#8217;;<br \/>\n                charsWritten = 5;<br \/>\n                return true;<br \/>\n            }<br \/>\n        }<\/p>\n<p>        charsWritten = 0;<br \/>\n        return false;<br \/>\n    }<br \/>\n}<\/p>\n<p>Pretty simple: we\u2019re writing out each individual value. That\u2019s a bit unfortunate, though, in that we\u2019re naively then spending several movs to write each character individually, when instead we could pack all of those values into a single value to write. In fact, that\u2019s exactly what the real bool.TryFormat does. Here is its handling of the true case today:<\/p>\n<p>if (destination.Length &gt; 3)<br \/>\n{<br \/>\n    ulong true_val = BitConverter.IsLittleEndian ? 0x65007500720054ul : 0x54007200750065ul; \/\/ &#8220;True&#8221;<br \/>\n    MemoryMarshal.Write(MemoryMarshal.AsBytes(destination), in true_val);<br \/>\n    charsWritten = 4;<br \/>\n    return true;<br \/>\n}<\/p>\n<p>The developer has manually done the work of computing the value of the merged writes, e.g.<\/p>\n<p>ulong true_val = (((ulong)&#8217;e&#8217; &lt;&lt; 48) | ((ulong)&#8217;u&#8217; &lt;&lt; 32) | ((ulong)&#8217;r&#8217; &lt;&lt; 16) | (ulong)&#8217;T&#8217;)<br \/>\nAssert.Equal(0x65007500720054ul, true_val);<\/p>\n<p>in order to be able to perform a single write rather than doing four individual ones. For this particular case, now in .NET 9, the JIT can automatically do this merging so the developer doesn\u2019t have to. The developer just writes the code that\u2019s natural to write, and the JIT does the heavy lifting of optimizing its output (note below the mov rax, 65007500720054 instruction, loading the same value we manually computed above).<\/p>\n<p>\/\/ .NET 8<br \/>\n; Tests.TryFormat(Char[], Int32 ByRef)<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       cmp       byte ptr [rdi+10],0<br \/>\n       jne       short M01_L01<br \/>\n       mov       ecx,[rsi+8]<br \/>\n       cmp       ecx,5<br \/>\n       jl        short M01_L00<br \/>\n       mov       word ptr [rsi+10],46<br \/>\n       mov       word ptr [rsi+12],61<br \/>\n       mov       word ptr [rsi+14],6C<br \/>\n       mov       word ptr [rsi+16],73<br \/>\n       mov       word ptr [rsi+18],65<br \/>\n       mov       dword ptr [rdx],5<br \/>\n       mov       eax,1<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM01_L00:<br \/>\n       xor       eax,eax<br \/>\n       mov       [rdx],eax<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM01_L01:<br \/>\n       mov       ecx,[rsi+8]<br \/>\n       cmp       ecx,4<br \/>\n       jl        short M01_L00<br \/>\n       mov       word ptr [rsi+10],54<br \/>\n       mov       word ptr [rsi+12],72<br \/>\n       mov       word ptr [rsi+14],75<br \/>\n       mov       word ptr [rsi+16],65<br \/>\n       mov       dword ptr [rdx],4<br \/>\n       mov       eax,1<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\n; Total bytes of code 112<\/p>\n<p>\/\/ .NET 9<br \/>\n; Tests.TryFormat(Char[], Int32 ByRef)<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       cmp       byte ptr [rdi+10],0<br \/>\n       jne       short M01_L00<br \/>\n       mov       ecx,[rsi+8]<br \/>\n       cmp       ecx,5<br \/>\n       jl        short M01_L01<br \/>\n       mov       rax,73006C00610046<br \/>\n       mov       [rsi+10],rax<br \/>\n       mov       word ptr [rsi+18],65<br \/>\n       mov       dword ptr [rdx],5<br \/>\n       mov       eax,1<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM01_L00:<br \/>\n       mov       ecx,[rsi+8]<br \/>\n       cmp       ecx,4<br \/>\n       jl        short M01_L01<br \/>\n       mov       rax,65007500720054<br \/>\n       mov       [rsi+10],rax<br \/>\n       mov       dword ptr [rdx],4<br \/>\n       mov       eax,1<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM01_L01:<br \/>\n       xor       eax,eax<br \/>\n       mov       [rdx],eax<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\n; Total bytes of code 92<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92939\">dotnet\/runtime#92939<\/a> improves this further by enabling longer sequences to similarly be merged using SIMD instructions.<\/p>\n<p>Of course, you may then wonder, why wasn\u2019t bool.TryFormat reverted to use the simpler code? The unfortunate answer is that this optimization only currently applies to array targets rather than span targets. That\u2019s because there are alignment requirements for performing these kinds of writes, and whereas the JIT can make certain assumptions about the alignment of arrays, it can\u2019t make those same assumptions about spans, which can represent slices of something else at unaligned boundaries. This is now one of the few cases where arrays are better than spans; typically span is as good or better. But I\u2019m hopeful it will be improved in the future.<\/p>\n<p>Another nice improvement is <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86811\">dotnet\/runtime#86811<\/a> from <a href=\"https:\/\/github.com\/BladeWise\">@BladeWise<\/a>, which adds SIMD support for multiplying two vectors of bytes or sbytes. Previously this would end up falling back to a software implementation, which is very slow compared to true SIMD operations. Now, the code is much faster and much more compact.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Runtime.Intrinsics;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private Vector128&lt;byte&gt; _v1 = Vector128.Create((byte)0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);<\/p>\n<p>    [Benchmark]<br \/>\n    public Vector128&lt;byte&gt; Square() =&gt; _v1 * _v1;<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Square<br \/>\n.NET 8.0<br \/>\n15.4731 ns<br \/>\n1.000<\/p>\n<p>Square<br \/>\n.NET 9.0<br \/>\n0.0284 ns<br \/>\n0.002<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103555\">dotnet\/runtime#103555<\/a> (x64, when AVX512 isn\u2019t available) and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104177\">dotnet\/runtime#104177<\/a> (Arm64) also improve vector multiplication, this time for long\/ulong. This can be seen with a simple micro-benchmark (and because I\u2019m running on a machine that supports AVX512, the benchmark is explicitly disabling it):<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Configs;<br \/>\nusing BenchmarkDotNet.Jobs;<br \/>\nusing BenchmarkDotNet.Environments;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Runtime.Intrinsics;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, DefaultConfig.Instance<br \/>\n    .AddJob(Job.Default.WithId(&#8220;.NET 8&#8221;).WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable(&#8220;DOTNET_EnableAVX512F&#8221;, &#8220;0&#8221;).AsBaseline())<br \/>\n    .AddJob(Job.Default.WithId(&#8220;.NET 9&#8221;).WithRuntime(CoreRuntime.Core90).WithEnvironmentVariable(&#8220;DOTNET_EnableAVX512F&#8221;, &#8220;0&#8221;)));<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private Vector256&lt;long&gt; _a = Vector256.Create(1, 2, 3, 4);<br \/>\n    private Vector256&lt;long&gt; _b = Vector256.Create(5, 6, 7, 8);<\/p>\n<p>    [Benchmark]<br \/>\n    public Vector256&lt;long&gt; Multiply() =&gt; Vector256.Multiply(_a, _b);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Multiply<br \/>\n.NET 8.0<br \/>\n9.5448 ns<br \/>\n1.00<\/p>\n<p>Multiply<br \/>\n.NET 9.0<br \/>\n0.3868 ns<br \/>\n0.04<\/p>\n<p>It\u2019s also evident, however, on higher-level benchmarks, for example on this benchmark for XxHash128, an implementation that makes heavy use of multiplication of such vectors.<\/p>\n<p>\/\/ Add a &lt;PackageReference Include=&#8221;System.IO.Hashing&#8221; Version=&#8221;8.0.0&#8243; \/&gt; to the csproj.<br \/>\n\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing BenchmarkDotNet.Configs;<br \/>\nusing BenchmarkDotNet.Jobs;<br \/>\nusing BenchmarkDotNet.Environments;<br \/>\nusing System.IO.Hashing;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, DefaultConfig.Instance<br \/>\n    .AddJob(Job.Default.WithId(&#8220;.NET 8&#8221;).WithRuntime(CoreRuntime.Core80).WithEnvironmentVariable(&#8220;DOTNET_EnableAVX512F&#8221;, &#8220;0&#8221;).AsBaseline())<br \/>\n    .AddJob(Job.Default.WithId(&#8220;.NET 9&#8221;).WithRuntime(CoreRuntime.Core90).WithEnvironmentVariable(&#8220;DOTNET_EnableAVX512F&#8221;, &#8220;0&#8221;)));<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private byte[] _data;<\/p>\n<p>    [GlobalSetup]<br \/>\n    public void Setup()<br \/>\n    {<br \/>\n        _data = new byte[1024 * 1024];<br \/>\n        new Random(42).NextBytes(_data);<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public UInt128 Hash() =&gt; XxHash128.HashToUInt128(_data);<br \/>\n}<\/p>\n<p>This benchmark references the System.IO.Hashing nuget package. Note that we\u2019re explicitly adding in a reference to the 8.0.0 version; that means that even when running on .NET 9, we\u2019re using the .NET 8 version of the hashing code, yet it\u2019s still significantly faster, because of these runtime improvements.<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Hash<br \/>\n.NET 8.0<br \/>\n40.49 us<br \/>\n1.00<\/p>\n<p>Hash<br \/>\n.NET 9.0<br \/>\n26.40 us<br \/>\n0.65<\/p>\n<p>Some other notable examples:<\/p>\n<p><strong>Improved SIMD comparisons.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104944\">dotnet\/runtime#104944<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104215\">dotnet\/runtime#104215<\/a> improve how vector comparisons are handled.<br \/>\n<strong>Improved ConditionalSelects.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104092\">dotnet\/runtime#104092<\/a> from <a href=\"https:\/\/github.com\/ezhevita\">@ezhevita<\/a> improves the generated code for ConditionalSelects when the condition is a set of constants.<br \/>\n<strong>Better Const Handling.<\/strong> Certain operations are only optimized when one of their arguments is a constant, otherwise falling back to a much slower software emulation implementation. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102827\">dotnet\/runtime#102827<\/a> enables such instructions (like for shuffling) to continue to be treated as optimized operations if the non-const argument becomes a constant as part of other optimizations (like inlining).<br \/>\n<strong>Unblocking other optimizations.<\/strong> Some changes don\u2019t themselves introduce optimizations, but instead make tweaks that enable other optimizations to do a better job. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104517\">dotnet\/runtime#104517<\/a> decomposes some bitwise operations (e.g. replacing a unified \u201cand not\u201d operation with an \u201cand\u201d and a \u201cnot\u201d), which in turn enables other existing optimizations like common sub-expression elimination (CSE) to kick in more often. And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104214\">dotnet\/runtime#104214<\/a> normalized various negation patterns, which similarly enables other optimizations to apply in more places.<\/p>\n<h3>Branching<\/h3>\n<p>Just like the JIT tries to elide redundant bounds checking, where it can prove the bounds check is unnecessary, it similarly does so for branching.<\/p>\n<p>The ability to handle the relationships between branches is improved in .NET 9. Consider this benchmark:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    [Arguments(50)]<br \/>\n    public void Test(int x)<br \/>\n    {<br \/>\n        if (x &gt; 100)<br \/>\n        {<br \/>\n            Helper(x);<br \/>\n        }<br \/>\n    }<\/p>\n<p>    private void Helper(int x)<br \/>\n    {<br \/>\n        if (x &gt; 10)<br \/>\n        {<br \/>\n            Console.WriteLine(&#8220;Hello!&#8221;);<br \/>\n        }<br \/>\n    }<br \/>\n}<\/p>\n<p>The Helper function is simple enough to be inlined, and in .NET 8 we end up with this assembly:<\/p>\n<p>; Tests.Test(Int32)<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       cmp       esi,64<br \/>\n       jg        short M00_L01<br \/>\nM00_L00:<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM00_L01:<br \/>\n       cmp       esi,0A<br \/>\n       jle       short M00_L00<br \/>\n       mov       rdi,7F35E44C7E18<br \/>\n       pop       rbp<br \/>\n       jmp       qword ptr [7F35E914C7C8]<br \/>\n; Total bytes of code 33<\/p>\n<p>We can see in the original code that the branch within the inlined Helper is entirely unnecessary: we\u2019re only there if x is greater than 100, so it\u2019s definitely greater than 10, yet in the assembly code, we have both comparisons happening (notice the two cmps). Now in .NET 9, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/95234\">dotnet\/runtime#95234<\/a> which improves the JIT\u2019s ability to reason about the relationship between two ranges and whether one is implied by the other, we get this instead:<\/p>\n<p>; Tests.Test(Int32)<br \/>\n       cmp       esi,64<br \/>\n       jg        short M00_L00<br \/>\n       ret<br \/>\nM00_L00:<br \/>\n       mov       rdi,7F81C120EE20<br \/>\n       jmp       qword ptr [7F8148626628]<br \/>\n; Total bytes of code 22<\/p>\n<p>Just the one outer cmp. The same thing happens for the negative case: if we tweak the x &gt; 10 to instead be x &lt; 10, we end up with this:<\/p>\n<p>\/\/ .NET 8<br \/>\n; Tests.Test(Int32)<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       cmp       esi,64<br \/>\n       jg        short M00_L01<br \/>\nM00_L00:<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM00_L01:<br \/>\n       cmp       esi,0A<br \/>\n       jge       short M00_L00<br \/>\n       mov       rdi,7F6138428DE0<br \/>\n       pop       rbp<br \/>\n       jmp       qword ptr [7FA1DDD4C7C8]<br \/>\n; Total bytes of code 33<\/p>\n<p>\/\/ .NET 9<br \/>\n; Tests.Test(Int32)<br \/>\n       ret<br \/>\n; Total bytes of code 1<\/p>\n<p>Similar to the x &gt; 10 case, on .NET 8 the JIT retained both branches. But on .NET 9, it recognized that not only was the inner conditional redundant, it was redundant in a way that would make it always false, which then allowed it to dead-code eliminate the body of that if, leaving the whole method a nop. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/94689\">dotnet\/runtime#94689<\/a> makes this kind of information flow by enabling the JIT\u2019s support for \u201ccross-block local assertion prop\u201d.<\/p>\n<p>Another PR that eliminated some redundant branches is <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/94563\">dotnet\/runtime#94563<\/a>, which feeds information from value numbering (a technique used to eliminate redundant expressions by giving every unique expression its own unique identifier) into the building of PHIs (a kind of node in the JIT\u2019s intermediate representation of the code that aids in determining a variable\u2019s value based on control flow). Consider this benchmark:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Runtime.CompilerServices;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic unsafe class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    [Arguments(50)]<br \/>\n    public void Test(int x)<br \/>\n    {<br \/>\n        byte[] data = new byte[128];<br \/>\n        fixed (byte* ptr = data)<br \/>\n        {<br \/>\n            Nop(ptr);<br \/>\n        }<br \/>\n    }<\/p>\n<p>    [MethodImpl(MethodImplOptions.NoInlining)]<br \/>\n    private static void Nop(byte* ptr) { }<br \/>\n}<\/p>\n<p>This is allocating a byte[] and then pinning it in order to use it with a method that requires a pointer. The C# specification for fixed with arrays states \u201cIf the array expression is null or if the array has zero elements, the initializer computes an address equal to zero,\u201d and as such if you look at the IL for this code, you\u2019ll see that it\u2019s checking the length and setting the pointer equal to 0 if the length is 0. You can see this same behavior explicitly implemented as well for spans if you look at Span&lt;T&gt;\u2018s GetPinnableReference implementation:<\/p>\n<p>public ref T GetPinnableReference()<br \/>\n{<br \/>\n    ref T ret = ref Unsafe.NullRef&lt;T&gt;();<br \/>\n    if (_length != 0) ret = ref _reference;<br \/>\n    return ref ret;<br \/>\n}<\/p>\n<p>As such, there\u2019s actually an extra branch not visible in the Tests.Test test. But, in this particular case, that branch is also redundant, because we can very clearly see (and the JIT should be able to as well) that the length of the array is non-0. On .NET 8, we still get that branch:<\/p>\n<p>; Tests.Test(Int32)<br \/>\n       push      rbp<br \/>\n       sub       rsp,10<br \/>\n       lea       rbp,[rsp+10]<br \/>\n       xor       eax,eax<br \/>\n       mov       [rbp-8],rax<br \/>\n       mov       rdi,offset MT_System.Byte[]<br \/>\n       mov       esi,80<br \/>\n       call      CORINFO_HELP_NEWARR_1_VC<br \/>\n       mov       [rbp-8],rax<br \/>\n       mov       rdi,[rbp-8]<br \/>\n       cmp       dword ptr [rdi+8],0<br \/>\n       je        short M00_L01<br \/>\n       mov       rdi,[rbp-8]<br \/>\n       cmp       dword ptr [rdi+8],0<br \/>\n       jbe       short M00_L02<br \/>\n       mov       rdi,[rbp-8]<br \/>\n       add       rdi,10<br \/>\nM00_L00:<br \/>\n       call      qword ptr [7F3F99B45C98]; Tests.Nop(Byte*)<br \/>\n       xor       eax,eax<br \/>\n       mov       [rbp-8],rax<br \/>\n       add       rsp,10<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM00_L01:<br \/>\n       xor       edi,edi<br \/>\n       jmp       short M00_L00<br \/>\nM00_L02:<br \/>\n       call      CORINFO_HELP_RNGCHKFAIL<br \/>\n       int       3<br \/>\n; Total bytes of code 96<\/p>\n<p>But now on .NET 9, that branch (in fact, multiple redundant branches) is removed:<\/p>\n<p>; Tests.Test(Int32)<br \/>\n       push      rax<br \/>\n       xor       eax,eax<br \/>\n       mov       [rsp],rax<br \/>\n       mov       rdi,offset MT_System.Byte[]<br \/>\n       mov       esi,80<br \/>\n       call      CORINFO_HELP_NEWARR_1_VC<br \/>\n       mov       [rsp],rax<br \/>\n       add       rax,10<br \/>\n       mov       rdi,rax<br \/>\n       call      qword ptr [7F22DAC844C8]; Tests.Nop(Byte*)<br \/>\n       xor       eax,eax<br \/>\n       mov       [rsp],rax<br \/>\n       add       rsp,8<br \/>\n       ret<br \/>\n; Total bytes of code 55<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87656\">dotnet\/runtime#87656<\/a> is another nice example and addition to the JIT\u2019s optimization repertoire. As was discussed earlier, branches have costs associated with them. A hardware\u2019s branch predictor can often do a very good job of mitigating the bulk of those costs, but there\u2019s still some, and even if it were fully mitigated in the common case, a branch prediction failure can be relatively very costly. As such, minimizing branches can be very helpful, and if nothing else, turning branch-based operations into branchless ones leads to more consistent and predictable throughput, as it\u2019s then less subject to the nature of the data being processed. Consider the following function that\u2019s used to determine whether a character is a particular subset of whitespace characters:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    [Arguments(&#8216;s&#8217;)]<br \/>\n    public bool IsJsonWhitespace(int c)<br \/>\n    {<br \/>\n        if (c == &#8216; &#8216; || c == &#8216;t&#8217; || c == &#8216;r&#8217; || c == &#8216;n&#8217;)<br \/>\n        {<br \/>\n            return true;<br \/>\n        }<\/p>\n<p>        return false;<br \/>\n    }<br \/>\n}<\/p>\n<p>On .NET 8, we get what you\u2019d probably expect, a series of cmps followed by conditional jumps, one for each character:<\/p>\n<p>; Tests.IsJsonWhitespace(Int32)<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       cmp       esi,20<br \/>\n       je        short M00_L00<br \/>\n       cmp       esi,9<br \/>\n       je        short M00_L00<br \/>\n       cmp       esi,0D<br \/>\n       je        short M00_L00<br \/>\n       cmp       esi,0A<br \/>\n       je        short M00_L00<br \/>\n       xor       eax,eax<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM00_L00:<br \/>\n       mov       eax,1<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\n; Total bytes of code 35<\/p>\n<p>On .NET 9, though, we now get this:<\/p>\n<p>; Tests.IsJsonWhitespace(Int32)<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       cmp       esi,20<br \/>\n       ja        short M00_L00<br \/>\n       mov       eax,0FFFFD9FF<br \/>\n       bt        rax,rsi<br \/>\n       jae       short M00_L01<br \/>\nM00_L00:<br \/>\n       xor       eax,eax<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM00_L01:<br \/>\n       mov       eax,1<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\n; Total bytes of code 31<\/p>\n<p>It\u2019s now using a bt instruction (a bit test) against a pattern where there\u2019s a bit set for each of the characters being tested against, consolidating most of the branches down to just this one.<\/p>\n<p>Unfortunately, this also highlights that such optimizations, which are looking for a particular pattern, can get knocked off their golden path, at which point the optimization won\u2019t kick in. In this case, there are several ways it can get knocked off. The most obvious is if there are too many values or if they\u2019re too spread out, such that they can\u2019t fit into the 32-bit or 64-bit bit mask. More interesting, if you switch it to instead use C# pattern matching (e.g. c is &#8216; &#8216; or &#8216;t&#8217; or &#8216;r&#8217; or &#8216;n&#8217;), it also doesn\u2019t kick in. Why? Because the C# compiler itself is trying to optimize, and the pattern it ends up generating in the IL is different from what this optimization is expecting. I expect this\u2019ll get better in the future, but it\u2019s a good reminder that these kinds of optimizations are useful when they make arbitrary code better, but if you\u2019re coding to the exact nature of the optimization and relying on it happening, you really need to be paying attention.<\/p>\n<p>A related optimization was added in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/93521\">dotnet\/runtime#93521<\/a>. Consider a function like the following, which is checking to see whether a character is a lower-case hexadecimal char:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    [Arguments(&#8216;s&#8217;)]<br \/>\n    public bool IsHexLower(char c)<br \/>\n    {<br \/>\n        if ((c &gt;= &#8216;0&#8217; &amp;&amp; c &lt;= &#8216;9&#8217;) || (c &gt;= &#8216;a&#8217; &amp;&amp; c &lt;= &#8216;f&#8217;))<br \/>\n        {<br \/>\n            return true;<br \/>\n        }<\/p>\n<p>        return false;<br \/>\n    }<br \/>\n}<\/p>\n<p>On .NET 8, we get a comparison against &#8216;0&#8217;, a comparison again &#8216;9&#8217;, a comparison against &#8216;a&#8217;, and a comparison against &#8216;f&#8217;, with a conditional branch for each:<\/p>\n<p>; Tests.IsHexLower(Char)<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       movzx     eax,si<br \/>\n       cmp       eax,30<br \/>\n       jl        short M00_L00<br \/>\n       cmp       eax,39<br \/>\n       jle       short M00_L02<br \/>\nM00_L00:<br \/>\n       cmp       eax,61<br \/>\n       jl        short M00_L01<br \/>\n       cmp       eax,66<br \/>\n       jle       short M00_L02<br \/>\nM00_L01:<br \/>\n       xor       eax,eax<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM00_L02:<br \/>\n       mov       eax,1<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\n; Total bytes of code 38<\/p>\n<p>But on .NET 9, we instead get this:<\/p>\n<p>; Tests.IsHexLower(Char)<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       movzx     eax,si<br \/>\n       mov       ecx,eax<br \/>\n       sub       ecx,30<br \/>\n       cmp       ecx,9<br \/>\n       jbe       short M00_L00<br \/>\n       sub       eax,61<br \/>\n       cmp       eax,5<br \/>\n       jbe       short M00_L00<br \/>\n       xor       eax,eax<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM00_L00:<br \/>\n       mov       eax,1<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\n; Total bytes of code 36<\/p>\n<p>Effectively the JIT has rewritten the condition as if I\u2019d written it like this:<\/p>\n<p>(((uint)c &#8211; &#8216;0&#8217;) &lt;= (&#8216;9&#8217; &#8211; &#8216;0&#8217;)) || (((uint)c &#8211; &#8216;a&#8217;) &lt;= (&#8216;f&#8217; &#8211; &#8216;a&#8217;))<\/p>\n<p>which is nice, because it\u2019s replaced two of the conditional branches with two (cheaper) subtractions.<\/p>\n<h3>Write Barriers<\/h3>\n<p>The .NET garbage collector (GC) is a generational collector. That means it divides the heap up logically by object age, where \u201cgeneration 0\u201d (or \u201cgen0\u201d) are the newest objects that haven\u2019t been around for very long, \u201cgen2\u201d are the objects that have been around for a while, and \u201cgen1\u201d are in the middle. This approach is based on the theory (that also generally plays out in practice) that most objects end up being very short-lived, created for some task and then quickly dropped, and conversely that if an object has been around for a while, there\u2019s a really good chance it\u2019ll continue to be around for a while. By partitioning up objects like this, the GC can reduce the amount of work it needs to do when it scans for objects to be collected. It can do a scan focused only on gen0 objects, allowing it to ignore anything in gen1 or gen2 and thereby make its scan much faster. Or at least, that\u2019s the goal. If it were to only scan gen0 objects, though, it could easily think a gen0 object wasn\u2019t referenced because it couldn\u2019t find any references to one from other gen0 objects\u2026 but there may have been a reference from a gen1 or gen2 object. That would be bad. How does the GC deal with this then, having its cake and eating it, too? It colludes with the rest of the runtime to track any time its generational assumptions might be violated. The GC maintains a table (called the \u201ccard table\u201d) that indicates whether an object in a higher generation <em>might<\/em> contain a reference to a lower generation object, and any time a reference is written such that there could end up being a reference from a higher generation to a lower one, this table is updated. Then when the GC does its scan, it only needs to examine higher generation objects if the relevant bit in the table is set (the table doesn\u2019t track individual objects, just ranges of them, so it\u2019s similar to a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Bloom_filter\">\u201cBloom filter\u201d<\/a>, where the lack of a bit means there\u2019s definitely not a reference but the presence of a bit only means there <em>might<\/em> be a reference).<\/p>\n<p>The code that\u2019s executed to track the reference write and possibly update the card table is referred to as a GC write barrier. And, obviously, if that code is happening every time a reference is written to an object, you really, really, really want that code to be efficient. There are actually multiple different forms of GC write barriers, all specialized for slightly different purposes.<\/p>\n<p>The standard GC write barrier is CORINFO_HELP_ASSIGN_REF. However, there\u2019s another one called CORINFO_HELP_CHECKED_ASSIGN_REF that needs to do a bit more work. The JIT is the one deciding which of these to use, and it uses the latter when it\u2019s possible the target isn\u2019t on the heap, in which case the barrier needs to do a little more work to figure that out.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98166\">dotnet\/runtime#98166<\/a> helps the JIT do better in a certain case. If you have a static field of a value type:<\/p>\n<p>static SomeStruct s_someField;<br \/>\n&#8230;<br \/>\nstruct SomeStruct<br \/>\n{<br \/>\n    public object Obj;<br \/>\n}<\/p>\n<p>the runtime implements that by having a box associated with that field for storing that struct. Such static boxes are always on the heap, so if you then do:<\/p>\n<p>static void Store(object o) =&gt; s_someField.Obj = o;<\/p>\n<p>the JIT can prove that the cheaper unchecked write barrier may be used, and this PR teaches it that. Previously sometimes the JIT would be able to figure it out, but this effectively ensures it.<\/p>\n<p>Another similar improvement comes from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97953\">dotnet\/runtime#97953<\/a>. Here\u2019s an example based on ConcurrentQueue&lt;T&gt;, which maintains arrays of elements, each of which is the actual item tagged with a sequence number that\u2019s used by the implementation to maintain correctness in the face of concurrency.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private Slot&lt;object&gt;[] _arr = new Slot&lt;object&gt;[1];<br \/>\n    private object _obj = new object();<\/p>\n<p>    [Benchmark]<br \/>\n    public void Test() =&gt; Store(_arr, _obj);<\/p>\n<p>    private static void Store&lt;T&gt;(Slot&lt;T&gt;[] arr, T o)<br \/>\n    {<br \/>\n        arr[0].Item = o;<br \/>\n        arr[0].SequenceNumber = 1;<br \/>\n    }<\/p>\n<p>    private struct Slot&lt;T&gt;<br \/>\n    {<br \/>\n        public T Item;<br \/>\n        public int SequenceNumber;<br \/>\n    }<br \/>\n}<\/p>\n<p>Here as well we can see on .NET 8 it\u2019s using the more expensive checked write barrier, but on .NET 9 the JIT has recognized it can use the cheaper unchecked write barrier:<\/p>\n<p>\/\/ .NET 8<br \/>\n; Tests.Test()<br \/>\n       push      rbx<br \/>\n       mov       rbx,[rdi+8]<br \/>\n       mov       rsi,[rdi+10]<br \/>\n       cmp       dword ptr [rbx+8],0<br \/>\n       jbe       short M00_L00<br \/>\n       add       rbx,10<br \/>\n       mov       rdi,rbx<br \/>\n       call      CORINFO_HELP_CHECKED_ASSIGN_REF<br \/>\n       mov       dword ptr [rbx+8],1<br \/>\n       pop       rbx<br \/>\n       ret<br \/>\nM00_L00:<br \/>\n       call      CORINFO_HELP_RNGCHKFAIL<br \/>\n       int       3<br \/>\n; Total bytes of code 42<\/p>\n<p>\/\/ .NET 9<br \/>\n; Tests.Test()<br \/>\n       push      rbx<br \/>\n       mov       rbx,[rdi+8]<br \/>\n       mov       rsi,[rdi+10]<br \/>\n       cmp       dword ptr [rbx+8],0<br \/>\n       jbe       short M00_L00<br \/>\n       add       rbx,10<br \/>\n       mov       rdi,rbx<br \/>\n       call      CORINFO_HELP_ASSIGN_REF<br \/>\n       mov       dword ptr [rbx+8],1<br \/>\n       pop       rbx<br \/>\n       ret<br \/>\nM00_L00:<br \/>\n       call      CORINFO_HELP_RNGCHKFAIL<br \/>\n       int       3<br \/>\n; Total bytes of code 42<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101761\">dotnet\/runtime#101761<\/a> actually introduces a new form of write barrier. Consider this:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private MyStruct _value;<br \/>\n    private Wrapper _wrapper = new();<\/p>\n<p>    [Benchmark]<br \/>\n    public void Store() =&gt; _wrapper.Value = _value;<\/p>\n<p>    private record struct MyStruct(string a1, string a2, string a3, string a4);<\/p>\n<p>    private class Wrapper<br \/>\n    {<br \/>\n        public MyStruct Value;<br \/>\n    }<br \/>\n}<\/p>\n<p>Previously as part of copying that struct, each of those fields (represented by a1 through a4) would individually incur a write barrier:<\/p>\n<p>; Tests.Store()<br \/>\n       push      rax<br \/>\n       mov       [rsp],rdi<br \/>\n       mov       rax,[rdi+8]<br \/>\n       lea       rdi,[rax+8]<br \/>\n       mov       rsi,[rsp]<br \/>\n       add       rsi,10<br \/>\n       call      CORINFO_HELP_ASSIGN_BYREF<br \/>\n       call      CORINFO_HELP_ASSIGN_BYREF<br \/>\n       call      CORINFO_HELP_ASSIGN_BYREF<br \/>\n       call      CORINFO_HELP_ASSIGN_BYREF<br \/>\n       nop<br \/>\n       add       rsp,8<br \/>\n       ret<br \/>\n; Total bytes of code 47<\/p>\n<p>Now in .NET 9, this PR added a new bulk write barrier, which can implement the operation more efficiently.<\/p>\n<p>; Tests.Store()<br \/>\n       push      rax<br \/>\n       mov       rsi,[rdi+8]<br \/>\n       add       rsi,8<br \/>\n       cmp       [rsi],sil<br \/>\n       add       rdi,10<br \/>\n       mov       [rsp],rdi<br \/>\n       cmp       [rdi],dil<br \/>\n       mov       rdi,rsi<br \/>\n       mov       rsi,[rsp]<br \/>\n       mov       edx,20<br \/>\n       call      qword ptr [7F5831BC5740]; System.Buffer.BulkMoveWithWriteBarrier(Byte ByRef, Byte ByRef, UIntPtr)<br \/>\n       nop<br \/>\n       add       rsp,8<br \/>\n       ret<br \/>\n; Total bytes of code 47<\/p>\n<p>Making GC write barriers faster is good; after all, they\u2019re used <em>a lot<\/em>. However, switching from the checked write barrier to the non-checked write barrier is a very micro optimization; the extra overhead of the checked variant is often just a couple of comparisons. A better optimization is avoiding the need for a barrier entirely! <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103503\">dotnet\/runtime#103503<\/a> recognizes that ref structs can\u2019t possibly be on the GC heap by their very nature, and as such, write barriers can be entirely elided when writing into the fields of a ref struct.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Runtime.CompilerServices;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public void Store()<br \/>\n    {<br \/>\n        MyRefStruct s = default;<br \/>\n        Test(ref s, new object(), new object());<br \/>\n    }<\/p>\n<p>    [MethodImpl(MethodImplOptions.NoInlining)]<br \/>\n    private void Test(ref MyRefStruct s, object o1, object o2)<br \/>\n    {<br \/>\n        s.Obj1 = o1;<br \/>\n        s.Obj2 = o2;<br \/>\n    }<\/p>\n<p>    private ref struct MyRefStruct<br \/>\n    {<br \/>\n        public object Obj1;<br \/>\n        public object Obj2;<br \/>\n    }<br \/>\n}<\/p>\n<p>On .NET 8, we have two barriers; on .NET 9, zero:<\/p>\n<p>\/\/ .NET 8<br \/>\n; Tests.Test(MyRefStruct ByRef, System.Object, System.Object)<br \/>\n       push      r15<br \/>\n       push      rbx<br \/>\n       mov       rbx,rsi<br \/>\n       mov       r15,rcx<br \/>\n       mov       rdi,rbx<br \/>\n       mov       rsi,rdx<br \/>\n       call      CORINFO_HELP_CHECKED_ASSIGN_REF<br \/>\n       lea       rdi,[rbx+8]<br \/>\n       mov       rsi,r15<br \/>\n       call      CORINFO_HELP_CHECKED_ASSIGN_REF<br \/>\n       nop<br \/>\n       pop       rbx<br \/>\n       pop       r15<br \/>\n       ret<br \/>\n; Total bytes of code 37<\/p>\n<p>\/\/ .NET 9<br \/>\n; Tests.Test(MyRefStruct ByRef, System.Object, System.Object)<br \/>\n       mov       [rsi],rdx<br \/>\n       mov       [rsi+8],rcx<br \/>\n       ret<br \/>\n; Total bytes of code 8<\/p>\n<p>Similarly, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102084\">dotnet\/runtime#102084<\/a> is able to remove some barriers on Arm64 as part of ref struct copies.<\/p>\n<h3>Object Stack Allocation<\/h3>\n<p>For years, .NET has explored the possibility of stack-allocating managed objects. It\u2019s something that other managed languages like Java are already capable of doing, but it\u2019s also more critical in Java, which lacks the equivalent of value types (e.g. if you want a list of integers, that\u2019d most likely be List&lt;Integer&gt;, which will box each integer value added to the list, similar to if List&lt;object&gt; were used in .NET). In .NET 9, object stack allocation starts to happen. Before you get too excited, it\u2019s limited in scope right now, but in the future it\u2019s likely to expand out further.<\/p>\n<p>The hardest part of stack allocating objects is ensuring that it\u2019s safe. If a reference to the object were to escape and end up being stored somewhere that outlived the stack frame containing the stack-allocated object, that would be very bad; when the method returned, those outstanding references would be pointing to garbage. So, the JIT needs to perform escape analysis to ensure that never happens, and doing that well is extremely challenging. For .NET 9, the support was introduced in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103361\">dotnet\/runtime#103361<\/a> (and brought to Native AOT in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104411\">dotnet\/runtime#104411<\/a>), and it doesn\u2019t do any interprocedural analysis, which means it\u2019s limited to only handling cases where it can easily prove the object reference doesn\u2019t leave the current frame. Even so, there are plenty of situations where this will help to eliminate allocations, and I expect it\u2019ll be expanded to handle more and more cases in the future. When the JIT does choose to allocate an object on the stack, it effectively promotes the fields of the object to be individual variables in the stack frame.<\/p>\n<p>Here\u2019s a very simple example of the mechanism in action:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public int GetValue() =&gt; new MyObj(42).Value;<\/p>\n<p>    private class MyObj<br \/>\n    {<br \/>\n        public MyObj(int value) =&gt; Value = value;<br \/>\n        public int Value { get; }<br \/>\n    }<br \/>\n}<\/p>\n<p>On .NET 8, the generated code for GetValue looks like this:<\/p>\n<p>; Tests.GetValue()<br \/>\n       push      rax<br \/>\n       mov       rdi,offset MT_Tests+MyObj<br \/>\n       call      CORINFO_HELP_NEWSFAST<br \/>\n       mov       dword ptr [rax+8],2A<br \/>\n       mov       eax,[rax+8]<br \/>\n       add       rsp,8<br \/>\n       ret<br \/>\n; Total bytes of code 31<\/p>\n<p>The generated code is allocating a new object, populating that object\u2019s Value, and then reading that Value as the value to return. On .NET 9, we instead end up with this picture of simplicity:<\/p>\n<p>; Tests.GetValue()<br \/>\n       mov       eax,2A<br \/>\n       ret<br \/>\n; Total bytes of code 6<\/p>\n<p>The JIT has inlined the constructor, inlined accesses to the Value property, promoted the field backing that property to be a variable, and in effect optimized the entire operation simply to be return 42;.<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nCode Size<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>GetValue<br \/>\n.NET 8.0<br \/>\n3.6037 ns<br \/>\n1.00<br \/>\n31 B<br \/>\n24 B<br \/>\n1.00<\/p>\n<p>GetValue<br \/>\n.NET 9.0<br \/>\n0.0519 ns<br \/>\n0.01<br \/>\n6 B<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>Here\u2019s another more impactful example. When it comes to performance optimization, it\u2019s really nice when the right things just happen; otherwise, developers need to learn the minute differences between performing an operation this way or that way. Every programming language and platform has non-trivial amounts of such things, but we really want to drive the number of them down. One interesting case for .NET has had to do with structs and casting. Consider these two Dispose1 and Dispose2 methods:<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Runtime.CompilerServices;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public void Test()<br \/>\n    {<br \/>\n        Dispose1&lt;MyStruct&gt;(default);<br \/>\n        Dispose2&lt;MyStruct&gt;(default);<br \/>\n    }<\/p>\n<p>    [MethodImpl(MethodImplOptions.NoInlining)]<br \/>\n    private bool Dispose1&lt;T&gt;(T o)<br \/>\n    {<br \/>\n        bool disposed = false;<br \/>\n        if (o is IDisposable disposable)<br \/>\n        {<br \/>\n            disposable.Dispose();<br \/>\n            disposed = true;<br \/>\n        }<br \/>\n        return disposed;<br \/>\n    }<\/p>\n<p>    [MethodImpl(MethodImplOptions.NoInlining)]<br \/>\n    private bool Dispose2&lt;T&gt;(T o)<br \/>\n    {<br \/>\n        bool disposed = false;<br \/>\n        if (o is IDisposable)<br \/>\n        {<br \/>\n            ((IDisposable)o).Dispose();<br \/>\n            disposed = true;<br \/>\n        }<br \/>\n        return disposed;<br \/>\n    }<\/p>\n<p>    private struct MyStruct : IDisposable<br \/>\n    {<br \/>\n        public void Dispose() { }<br \/>\n    }<br \/>\n}<\/p>\n<p>Ideally, if you call them with a value type T, there wouldn\u2019t be any allocation, but unfortunately, in Dispose1 because of how things line up here, the JIT would end up needing to box o to produce the IDisposable. Interestingly, due to optimizations several years ago, in Dispose2 the JIT is in fact able to elide the boxing. On .NET 8, we get this:<\/p>\n<p>; Tests.Dispose1[[Tests+MyStruct, benchmarks]](MyStruct)<br \/>\n       push      rbx<br \/>\n       mov       rdi,offset MT_Tests+MyStruct<br \/>\n       call      CORINFO_HELP_NEWSFAST<br \/>\n       add       rax,8<br \/>\n       mov       ebx,[rsp+10]<br \/>\n       mov       [rax],bl<br \/>\n       mov       eax,1<br \/>\n       pop       rbx<br \/>\n       ret<br \/>\n; Total bytes of code 33<\/p>\n<p>; Tests.Dispose2[[Tests+MyStruct, benchmarks]](MyStruct)<br \/>\n       mov       eax,1<br \/>\n       ret<br \/>\n; Total bytes of code 6<\/p>\n<p>This is one of those things that a developer would have to \u201cjust know,\u201d and also fight against tooling like <a href=\"https:\/\/learn.microsoft.com\/dotnet\/fundamentals\/code-analysis\/style-rules\/ide0020-ide0038\">IDE0038<\/a> that pushes developers to write this code like in my first version, whereas for structs the latter ends up being more efficient. This work on stack allocation makes that difference go away, because the boxing that occurs as part of the first version is a quintessential example of the allocation the compiler is now able to stack allocate. On .NET 9, we now end up with this:<\/p>\n<p>; Tests.Dispose1[[Tests+MyStruct, benchmarks]](MyStruct)<br \/>\n       mov       eax,1<br \/>\n       ret<br \/>\n; Total bytes of code 6<\/p>\n<p>; Tests.Dispose2[[Tests+MyStruct, benchmarks]](MyStruct)<br \/>\n       mov       eax,1<br \/>\n       ret<br \/>\n; Total bytes of code 6<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nCode Size<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>Test<br \/>\n.NET 8.0<br \/>\n5.726 ns<br \/>\n1.00<br \/>\n94 B<br \/>\n24 B<br \/>\n1.00<\/p>\n<p>Test<br \/>\n.NET 9.0<br \/>\n2.095 ns<br \/>\n0.37<br \/>\n45 B<br \/>\n\u2013<br \/>\n0.00<\/p>\n<h3>Inlining<\/h3>\n<p>Improvements in inlining were a major focus of previous releases, and will likely be a major focus again in the future. For .NET 9, there weren\u2019t a ton of changes, but there was one particularly impactful improvement.<\/p>\n<p>As a motivating example, consider ArgumentNullException.ThrowIfNull again. It is defined like this:<\/p>\n<p>public static void ThrowIfNull(object? arg, [CallerArgumentExpression(nameof(arg))] string? paramName = null);<\/p>\n<p>Notably, it\u2019s non-generic, and that\u2019s a question we get asked about at some relevant frequency. We chose not to make it generic for three reasons:<\/p>\n<p>The main benefit of making it generic would be to avoid boxing structs, but the JIT already eliminated said boxing in tier 1, and as was highlighted earlier in this post, it\u2019s possible for it to eliminate it in tier 0 as well (and now does).<br \/>\nEvery generic instantiation (using a generic with a different type) adds runtime overhead. We didn\u2019t want to bloat a process with such additional metadata and runtime data structures just to support argument validation that should rarely if ever fail in production.<br \/>\nWhen used with reference types (which is its raison d\u2019etre), it would not play well with inlining, but inlining of such a \u201cthrow helper\u201d is critical for performance. Generic methods with coreclr and Native AOT work in one of two ways. For value types, every time a generic is used with a different value type, an entire copy of the generic method is made and specialized for that parameter type; it\u2019s as if you wrote a dedicated version of that generic code that wasn\u2019t generic and was instead customized specifically for that type. For reference types, there\u2019s only one copy of the code that\u2019s then shared across all reference types, and it\u2019s parameterized at run-time based on the actual type being used. When you access such a shared generic, at run-time it ends up looking up in a dictionary the information about the generic argument and using the discovered information to inform the rest of the method. Historically, this has not been conducive to inlining.<\/p>\n<p>So, ThrowIfNull is non-generic. But, there are other throw helpers, many of them are generic. That\u2019s because a) they\u2019re primarily expected to work with value types, and b) we had no choice, given the nature of the methods. So, for example, ArgumentOutOfRangeException.ThrowIfEqual is generic on T, accepting two values of T to compare and throw if they\u2019re the same. And if T is a reference type, on .NET 8 it may not successfully inline if the caller is a shared generic as well. With this:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Runtime.CompilerServices;<\/p>\n<p>namespace Benchmarks;<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic unsafe class Tests<br \/>\n{<br \/>\n    private static void Main(string[] args) =&gt;<br \/>\n        BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>    [Benchmark]<br \/>\n    public void Test() =&gt; ThrowOrDispose(new Version(1, 0), new Version(1, 1));<\/p>\n<p>    [MethodImpl(MethodImplOptions.NoInlining)]<br \/>\n    private static void ThrowOrDispose&lt;T&gt;(T value, T invalid) where T : IEquatable&lt;T&gt;<br \/>\n    {<br \/>\n        ArgumentOutOfRangeException.ThrowIfEqual(value, invalid);<br \/>\n        if (value is IDisposable disposable)<br \/>\n        {<br \/>\n            disposable.Dispose();<br \/>\n        }<br \/>\n    }<br \/>\n}<\/p>\n<p>on .NET 8 we get this for the ThrowOrDispose method (this example benchmark has a slightly different shape from previous examples and this output is from Windows, for reasons to be made clearer shortly):<\/p>\n<p>; Benchmarks.Tests.ThrowOrDispose[[System.__Canon, System.Private.CoreLib]](System.__Canon, System.__Canon)<br \/>\n       push      rsi<br \/>\n       push      rbx<br \/>\n       sub       rsp,28<br \/>\n       mov       [rsp+20],rcx<br \/>\n       mov       rbx,rdx<br \/>\n       mov       rsi,r8<br \/>\n       mov       rdx,[rcx+10]<br \/>\n       mov       rax,[rdx+10]<br \/>\n       test      rax,rax<br \/>\n       je        short M01_L00<br \/>\n       mov       rcx,rax<br \/>\n       jmp       short M01_L01<br \/>\nM01_L00:<br \/>\n       mov       rdx,7FF996A8B170<br \/>\n       call      CORINFO_HELP_RUNTIMEHANDLE_METHOD<br \/>\n       mov       rcx,rax<br \/>\nM01_L01:<br \/>\n       mov       rdx,rbx<br \/>\n       mov       r8,rsi<br \/>\n       mov       r9,1DB81B20390<br \/>\n       call      qword ptr [7FF996AC5BC0]; System.ArgumentOutOfRangeException.ThrowIfEqual[[System.__Canon, System.Private.CoreLib]](System.__Canon, System.__Canon, System.String)<br \/>\n       mov       rdx,rbx<br \/>\n       mov       rcx,offset MT_System.IDisposable<br \/>\n       call      qword ptr [7FF996664348]; System.Runtime.CompilerServices.CastHelpers.IsInstanceOfInterface(Void*, System.Object)<br \/>\n       test      rax,rax<br \/>\n       jne       short M01_L03<br \/>\nM01_L02:<br \/>\n       add       rsp,28<br \/>\n       pop       rbx<br \/>\n       pop       rsi<br \/>\n       ret<br \/>\nM01_L03:<br \/>\n       mov       rcx,rax<br \/>\n       mov       r11,7FF9965204F8<br \/>\n       call      qword ptr [r11]<br \/>\n       jmp       short M01_L02<br \/>\n; Total bytes of code 124<\/p>\n<p>Two things in particular to notice here. First, we see there\u2019s a call to CORINFO_HELP_RUNTIMEHANDLE_METHOD; that\u2019s the helper being used to obtain information about the actual type T being used. Second, ThrowIfEqual is not being inlined; if that were being inlined, we wouldn\u2019t see that call to ThrowIfEqual here but instead we\u2019d see the actual code for ThrowIfEqual. We can confirm why it\u2019s not being inlined via another BenchmarkDotNet diagnoser: [InliningDiagnoser]. The JIT is capable of emitting events for much of its activity, including reporting on any successful or failed inlining operations, and [InliningDiagnoser] listens to those events and reports them as part of the benchmarking results. This particular diagnoser is in a separate BenchmarkDotNet.Diagnostics.Windows package and only works today when running on Windows, because it relies on ETW, hence for comparison why I made the previous benchmark also be Windows. When I add:<\/p>\n<p>[InliningDiagnoser(allowedNamespaces: [&#8220;Benchmarks&#8221;])]<\/p>\n<p>to my Tests class, and run the benchmarks for .NET 8:<\/p>\n<p>\/\/ Add a &lt;PackageReference Include=&#8221;BenchmarkDotNet.Diagnostics.Windows&#8221; Version=&#8221;9.0.0&#8243; \/&gt; to the csproj.<br \/>\n\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Diagnostics.Windows.Configs;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Runtime.CompilerServices;<\/p>\n<p>namespace Benchmarks;<\/p>\n<p>[InliningDiagnoser(allowedNamespaces: [&#8220;Benchmarks&#8221;])]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private static void Main(string[] args) =&gt;<br \/>\n        BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>    [Benchmark]<br \/>\n    public void Test() =&gt; ThrowOrDispose(new Version(1, 0), new Version(1, 1));<\/p>\n<p>    [MethodImpl(MethodImplOptions.NoInlining)]<br \/>\n    private static void ThrowOrDispose&lt;T&gt;(T value, T invalid) where T : IEquatable&lt;T&gt;<br \/>\n    {<br \/>\n        ArgumentOutOfRangeException.ThrowIfEqual(value, invalid);<br \/>\n        if (value is IDisposable disposable)<br \/>\n        {<br \/>\n            disposable.Dispose();<br \/>\n        }<br \/>\n    }<br \/>\n}<\/p>\n<p>I see this as part of the output:<\/p>\n<p>Inliner: Benchmarks.Tests.ThrowOrDispose &#8211; generic void  (!!0,!!0)<br \/>\nInlinee: System.ArgumentOutOfRangeException.ThrowIfEqual &#8211; generic void  (!!0,!!0,class System.String)<br \/>\nFail Reason: runtime dictionary lookup<\/p>\n<p>In other words, ThrowOrDispose called ThrowIfEqual but couldn\u2019t inline it because ThrowIfEqual contained a \u201cruntime dictionary lookup;\u201d in other words, it\u2019s a shared generic method.<\/p>\n<p>Now on .NET 9, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99265\">dotnet\/runtime#99265<\/a>, it is inlined! The resulting assembly is too large for me to show here, but we can see the impact in the benchmark results:<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Test<br \/>\n.NET 8.0<br \/>\n17.54 ns<br \/>\n1.00<\/p>\n<p>Test<br \/>\n.NET 9.0<br \/>\n12.76 ns<br \/>\n0.73<\/p>\n<p>and we can see it in the inlining report as successfully inlining.<\/p>\n<h2>GC<\/h2>\n<p>Applications end up having different needs when it comes to memory management. Would you be willing to throw more memory at maximizing throughput, or do you care more about minimizing working set? How important is it that unused memory be returned to the system aggressively? Is your expected workload constant or ebbing and flowing in nature? The GC has long had lots of knobs for configuring behavior based on these kinds of questions, but none more apparent than the choice of whether to use the \u201cworkstation GC\u201d or \u201cserver GC\u201d.<\/p>\n<p>By default, an application uses the workstation GC, though some environments (like ASP.NET) opt-in to using server GC automatically. You can explicitly opt-in in a variety of ways, including by adding &lt;ServerGarbageCollection&gt;true&lt;\/ServerGarbageCollection&gt; into your project file (as we did in the <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/#benchmarking-setup1\">Benchmarking Setup<\/a> section of this post). Workstation GC optimizes for reduced memory consumption, while server GC optimizes for maximum throughput. Historically, workstation employs a single heap, whereas server employs a heap per core. That typically represents a tradeoff between amount of memory consumed and the overhead of accessing a heap, such as the cost of allocating. If a bunch of threads are all trying to allocate at the same time, with server GC they\u2019re very likely to all be accessing different heaps, thereby reducing contention, whereas with workstation GC, they\u2019re all going to be fighting for access. Conversely, more heaps generally means more memory consumed (even though each heap could be smaller than the single one), especially in lull periods where the system might not be fully loaded, yet is paying in working set for those extra heaps.<\/p>\n<p>The decision for which to use isn\u2019t always so clear. Especially in the presence of containers, you frequently still care about really good throughput, but also don\u2019t want to be spending memory uselessly. Enter <a href=\"https:\/\/maoni0.medium.com\/dynamically-adapting-to-application-sizes-2d72fcb6f1ea\">\u201cDATAS,\u201d or \u201cDynamically Adapting To Application Sizes\u201d<\/a>. DATAS was introduced in .NET 8 and serves to narrow the gap between workstation and server GC, bringing server GC closer to workstation memory consumption. It dynamically scales how much memory is being consumed by server GC, such that in times of less load, less memory is being used. While DATAS shipped in .NET 8, it was only on by default for Native AOT-based projects, and even there it still had some issues to be sorted. Those issues have now been sorted (e.g. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98743\">dotnet\/runtime#98743<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100390\">dotnet\/runtime#100390<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102368\">dotnet\/runtime#102368<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/105545\">dotnet\/runtime#105545<\/a>), such that in .NET 9, as of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103374\">dotnet\/runtime#103374<\/a>, DATAS is now enabled by default for server GC.<\/p>\n<p>If you have a workload where absolute best possible throughput is paramount and you\u2019re ok with additional memory being consumed to enable that, you should feel free to disable DATAS, e.g. by adding this to your project file:<\/p>\n<p>&lt;GarbageCollectionAdaptationMode&gt;0&lt;\/GarbageCollectionAdaptationMode&gt;<\/p>\n<p>While DATAS being enabled by default is a very impactful improvement for .NET 9, there are other GC-related improvements in the release as well. For example, when compacting heaps, the GC may end up sorting objects by addresses. For large numbers of objects, this sort can be relatively expensive, and it behooves the GC to parallelize the sorting operation. For this purpose, several releases ago the GC incorporated a parallelized sorting algorithm called vxsort, which is effectively a quicksort with a parallelized partitioning step. However, it was only enabled for Windows (and only on x64). In .NET 9, it\u2019s enabled for Linux as well as part of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98712\">dotnet\/runtime#98712<\/a>. This helps to reduce GC pause times.<\/p>\n<h2>VM<\/h2>\n<p>The .NET runtime provides many services to managed code. There\u2019s the GC, of course, and the JIT compiler, and then there\u2019s a whole bunch of functionality around things like assembly and type loading, exception handling, configuration management, virtual dispatch, interop infrastructure, stub management, and so on. All of that functionality is generally referred to as being part of the coreclr virtual machine (VM).<\/p>\n<p>Many performance changes in this area are hard to demonstrate, but they\u2019re still worth mentioning. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101580\">dotnet\/runtime#101580<\/a> lazily-allocates some information related to method entrypoints, resulting in smaller heap sizes and less work on startup. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96857\">dotnet\/runtime#96857<\/a> also removed some unnecessary allocation happening related to data structures around methods. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96703\">dotnet\/runtime#96703<\/a> reduced the algorithmic complexity of some key functions involved in building up method tables, while <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96466\">dotnet\/runtime#96466<\/a> streamlined access to those tables, minimizing the number of indirections involved.<\/p>\n<p>Another set of changes went in to improving various calls from managed code into the VM. When managed code needs to call into the runtime, it has a couple of mechanisms it can employ. One is a \u201cQCALL,\u201d which is effectively just a P\/Invoke \/ DllImport into functions declared in the runtime. The other is an \u201cFCALL,\u201d a much more specialized and complicated mechanism for invoking runtime code that\u2019s capable of accessing managed objects. FCALL used to be the dominant mechanism, but each release more and more such calls are transitioned over to being QCALLs, which helps with both correctness (FCALLs can be hard to \u201cget right\u201d) and in some cases performance (some FCALLS need helper method frames that in turn typically make them more expensive than QCALLs). <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96860\">dotnet\/runtime#96860<\/a> converted over members of Marshal, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96916\">dotnet\/runtime#96916<\/a> did so for Interlocked, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96926\">dotnet\/runtime#96926<\/a> handled several more threading-related members, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97432\">dotnet\/runtime#97432<\/a> converted some of the built-in marshaling support, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97469\">dotnet\/runtime#97469<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100939\">dotnet\/runtime#100939<\/a> handled methods from GC and throughout reflection, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103211\">dotnet\/runtime#103211<\/a> from <a href=\"https:\/\/github.com\/AustinWise\">@AustinWise<\/a> converted GC.ReRegisterForFinalize, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/105584\">dotnet\/runtime#105584<\/a> converted Delegate.GetMulticastInvoke (which is used in APIs like Delegate.Combine and Delegate.Remove). <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97590\">dotnet\/runtime#97590<\/a> did the same for the slow path in ValueType.GetHashCode, while also converting the fast path to managed to avoid the transition entirely.<\/p>\n<p>But arguably the most impactful change in this area for .NET 9 is around exceptions. Exceptions are expensive and should be avoided where performance matters. But\u2026 just because they\u2019re expensive doesn\u2019t mean it\u2019s not valuable to make them less expensive. And in fact, there are cases where it\u2019s really worthwhile to make them less expensive. One of the things we sporadically observe in the wild are \u201cexception storms.\u201d Some failure happens, which causes another failure, which causes another. Each of those incurs exceptions. CPU consumption starts to spike as the overhead of those exceptions is incurred. Now other things start to time out because they\u2019re getting starved, and they throw exceptions, which in turn causes more failures. You get the idea.<\/p>\n<p>In <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-8\/\">Performance Improvements in .NET 8<\/a>, I highlighted that in my opinion the single most important performance improvement in the release was a single character change, enabling dynamic PGO by default. Now in .NET 9, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98570\">dotnet\/runtime#98570<\/a> is similar, a super small and simple PR that belies all the work that came before it. Earlier on, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88034\">dotnet\/runtime#88034<\/a> had ported the Native AOT exception handling implementation over to coreclr, but it was disabled by default due to still needing bake time. It\u2019s now had that bake time, and the new implementation is now on by default in .NET 9. And it\u2019s much faster. Things get better still with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103076\">dotnet\/runtime#103076<\/a>, which removes a global spinlock involved in the handling of exceptions.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public async Task ExceptionThrowCatch()<br \/>\n    {<br \/>\n        for (int i = 0; i &lt; 1000; i++)<br \/>\n        {<br \/>\n            try { await Recur(10); } catch { }<br \/>\n        }<br \/>\n    }<\/p>\n<p>    private async Task Recur(int depth)<br \/>\n    {<br \/>\n        if (depth &lt;= 0)<br \/>\n        {<br \/>\n            await Task.Yield();<br \/>\n            throw new Exception();<br \/>\n        }<\/p>\n<p>        await Recur(depth &#8211; 1);<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>ExceptionThrowCatch<br \/>\n.NET 8.0<br \/>\n123.03 ms<br \/>\n1.00<\/p>\n<p>ExceptionThrowCatch<br \/>\n.NET 9.0<br \/>\n54.68 ms<br \/>\n0.44<\/p>\n<h2>Mono<\/h2>\n<p>We frequently say \u201cthe runtime,\u201d but in reality there are currently multiple runtime implementations in .NET. \u201ccoreclr\u201d is the runtime thus far referred to, which is the default runtime used on Windows, Linux, and macOS, and for services and desktop applications, but there\u2019s also \u201cmono,\u201d which is mainly used when the form factor of the target application requires a small runtime: by default, it\u2019s the runtime that\u2019s used when building mobile apps for Android and iOS today, as well as the runtime used for Blazor WASM apps. mono has also seen a multitude of performance improvements in .NET 9:<\/p>\n<p><strong>Save\/restoring of profile data.<\/strong> One of the features provided by mono is an interpreter, which enables .NET code to execute in environments where JIT\u2019ing isn\u2019t permitted, as well as to enable faster startup. Specifically for when targeting WASM, the interpreter has a form of PGO where after methods have been invoked some number of times and are deemed important, it\u2019ll generate WASM on-the-fly to optimize those methods. This tiering gets better in .NET 9 with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92981\">dotnet\/runtime#92981<\/a>, which enables keeping track of which methods tiered up, and if the code is running in a browser, storing that information in the browser\u2019s cache for subsequent runs. When the code then runs subsequently, it can incorporate the previous learnings to tier up better and more quickly.<br \/>\n<strong>SSA-based Optimization.<\/strong> The compiler that generates that WASM applied optimizations primarily at the basic block level. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96315\">dotnet\/runtime#96315<\/a> overhauls the implementation to employ Static Single Assignment (SSA) form, which is commonly used by optimizing compilers and which ensures that every variable is assigned in exactly one location. That form simplifies many resulting analyses and thus helps to better optimize the code.<br \/>\n<strong>Vector improvements.<\/strong> More and more vectorization is being done by the core libraries, utilizing hardware intrinsics and the various Vector types. To enable such library code to execute well on mono, the various mono backends need to also handle those operations efficiently. One of the most impactful changes here is <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/105299\">dotnet\/runtime#105299<\/a>, which updated mono to accelerate Shuffle for types other than byte and sbyte (which were already handled). This is very impactful to functionality in the core libraries, many of which use Shuffle as part of core algorithms, like throughout IndexOfAny, hex encoding and decoding, Base64 encoding and decoding, Guid, and more. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92714\">dotnet\/runtime#92714<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98037\">dotnet\/runtime#98037<\/a> also improved vector construction, such as by enabling the mono JIT to utilize the Arm64 ins (Insert) instruction when creating one float or double vector from the values in another.<br \/>\n<strong>More intrinsics.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98077\">dotnet\/runtime#98077<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98514\">dotnet\/runtime#98514<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98710\">dotnet\/runtime#98710<\/a> implemented various AdvSimd.Load* and AdvSimd.Store* APIs. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99115\">dotnet\/runtime#99115<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101622\">dotnet\/runtime#101622<\/a> intrinsified several clearing and filling methods that back Span&lt;T&gt;.Clear\/Fill. And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/105150\">dotnet\/runtime#105150<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104698\">dotnet\/runtime#104698<\/a> optimized various Unsafe methods, such as BitCast. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/91813\">dotnet\/runtime#91813<\/a> also significantly improved unaligned access on a variety of CPUs by not forcing the implementation down a slow path if the CPU is able to handle such reads and writes.<br \/>\n<strong>Startup<\/strong>. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100146\">dotnet\/runtime#100146<\/a> is a fun one, as it had accidentally positive benefits for mono startup. The change updated dotnet\/runtime\u2019s configuration to enable more static analysis, and in particular enforcing <a href=\"https:\/\/learn.microsoft.com\/dotnet\/fundamentals\/code-analysis\/quality-rules\/ca1865-ca1867\">CA1865, CA1866, and CA1867<\/a>, which we hadn\u2019t yet gotten around to enabling for the repo. The change included fixing all of the violations of the rules, which mostly meant fixing call sites like IndexOf(&#8220;!&#8221;) (IndexOf taking a single-character string) and replacing it with IndexOf(&#8216;!&#8217;). The intent of the rule was that doing so is a little bit faster and the call site ends up being a little bit cleaner. But IndexOf(string) is cultural-aware, which means using it can force the globalization library ICU to be loaded and initialized. As it turns out, some of these uses were on mono\u2019s startup path and were forcing ICU to be loaded when it wasn\u2019t actually necessary. Fixing those meant the loading could be delayed, and startup performance improved as a result. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101312\">dotnet\/runtime#101312<\/a> also improved startup with the interpreter by adding a cache to the code that does vtable setups. This uses a custom hash table implementation added in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100386\">dotnet\/runtime#100386<\/a>, which is then also used elsewhere, such as in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101460\">dotnet\/runtime#101460<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102476\">dotnet\/runtime#102476<\/a>. That hash table is itself interesting, as it\u2019s lookups are vectorized for x64, Arm, and WASM and it\u2019s generally optimized for cache locality.<br \/>\n<strong>Variance check removal.<\/strong> When storing objects into an array, that operation needs to be validated to ensure compatibility between the type being stored and the concrete type of the array. Given a base type B and two derived types D1 : B and D2 : B, you could have an array B[] array = new D1[42];, and then the code array[0] = new D2() would successfully compile, because D2 is a B, but at run-time this must fail, as D2 is not a D1, and so the runtime needs a check to ensure correctness. If the array\u2019s type is sealed, though, this check can be avoided, since then you can\u2019t end up with this discrepancy. coreclr already does that optimization; now as part of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99829\">dotnet\/runtime#99829<\/a>, the mono interpreter does so as well.<\/p>\n<h2>Native AOT<\/h2>\n<p>Native AOT is a solution for generating native executables directly from .NET applications. The resulting binary doesn\u2019t require .NET to be installed and does not require JIT\u2019ing; instead it contains in it all of the assembly code for the whole app, inclusive of the code for any core library functionality accessed, the assembly for the garbage collector, and so on. Native AOT first shipped in .NET 7 and was then significantly improved for .NET 8, in particular around reducing the size of the resulting applications. Now in .NET 9, investment continues in Native AOT, with some very nice fruits of the labor on it. (Note that the Native AOT tool chain uses the JIT to generate assembly code, so most of the code generation improvements discussed in the <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/#jit2\">JIT<\/a> section and elsewhere in this post accrue to Native AOT as well.)<\/p>\n<p>One of the biggest concerns for Native AOT is size and trimming. Native AOT-based applications and libraries compile everything, all user code, all the library code, the runtime, everything, into the single native binary. It\u2019s thus imperative that the tool chain goes to extra lengths to get rid of as much as possible in order to keep that size down. This can include being more clever about how the runtime stores the state necessary for execution. It can include being more thoughtful about generics in order to reduce the possible code size explosion that can result from lots of generic instantiations (effectively multiple copies of the exact same code all specialized for different type arguments). And it can include being very diligent about avoiding dependencies that can bring in lots of code unexpectedly and that the trimming tools are unable to reason about enough to remove. Here are some examples of all of these in .NET 9:<\/p>\n<p><strong>Refactoring choke points.<\/strong> Think through your code: how many times have you written a method that takes some input and then dispatches to one of many different kinds of things based on the input provided? That\u2019s reasonably common. Unfortunately, it can also be problematic for Native AOT code size. A good example is fixed by <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/91185\">dotnet\/runtime#91185<\/a> in System.Security.Cryptography. There are a bunch of hashing related types, like SHA256 or SHA3_384, that all offer a HashData method. Then there are places where the exact hashing algorithm to be used is specified via a HashAlgorithmName. You can likely envision the large switch statement that results (or don\u2019t imagine and instead just look at <a href=\"https:\/\/github.com\/vcsjones\/runtime\/blob\/0dbb857f21f5177abe7dcd431b07f36272aa8e28\/src\/libraries\/Common\/src\/System\/Security\/Cryptography\/HashOneShotHelpers.cs#L15-L26\">the code<\/a>), where based on the exact HashAlgorithmName specified, the implementation selects the right type\u2019s HashData method to call. That is what\u2019s often referred to as a \u201cchoke point,\u201d where all callers end up coming through this one method, which then fans out to the relevant implementations, but that also then causes this size problem for Native AOT: if that choke point is referenced, it typically ends up needing to generate the code for all of the referenced methods, even if only a subset are actually used. Some of these cases are really challenging to solve. In this particular case, though, thankfully all of those HashData methods turned around and called to a parameterized, shared implementation. So the fix was to just skip the middle tier and have the HashAlgorithmName layer go directly to the workhorse implementation, without naming the intermediate layer methods.<br \/>\n<strong>Less LINQ.<\/strong> LINQ is an amazing productivity tool. We love LINQ and invest in it every release of .NET (see the later section in this post on <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/#linq32\">tons of performance improvements in LINQ in .NET 9<\/a>). With Native AOT, however, significant use of LINQ can also measurably increase code size, in particular when value types are involved. As will be discussed later when talking about LINQ optimizations, one of the optimizations LINQ employs is to special-case based on the inputs what kind of IEnumerable&lt;T&gt; its methods give back. So, for example, if you call Select with an array input, the IEnumerable&lt;T&gt; you get back might actually be an instance of the internal ArraySelectIterator&lt;T&gt;, and if you call Select with a List&lt;T&gt;, the IEnumerable&lt;T&gt; you get back might actually be an instance of the internal ListSelectIterator&lt;T&gt;. The Native AOT trimmer can\u2019t readily determine which of those paths might be used, so the Native AOT compiler needs to generate code for all such types when you call Select&lt;T&gt;. If the T is a reference type, there will just be a single copy of the generated code shared for all reference types. But if the T is a value type, there will be a custom stamp of the code generated for and optimized for each unique T. That means if such LINQ APIs (and other similar APIs) are used a lot, they can disproportionately increase the size of a Native AOT binary. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98109\">dotnet\/runtime#98109<\/a> is an example of a PR that replaced a bit of LINQ code in order to measurably reduce the size of ASP.NET applications compiled with Native AOT. But you can also see that PR being thoughtful about which LINQ usage was removed, citing these few specific instances making a measurable difference and leaving the rest of the LINQ usage in the library intact.<br \/>\n<strong>Avoiding unnecessary array types.<\/strong> The SharedArrayPool&lt;T&gt; that backs ArrayPool&lt;T&gt;.Shared was storing lots of state, including several fields with types along the lines of T[][]. This makes sense; it\u2019s pooling arrays, so it needs an array of arrays. From a Native AOT perspective, though, if T is a value type (as is very common with ArrayPool&lt;T&gt;), T[][] as its own unique array type needs its own code generated for it, distinct from the code for, for example, T[]. As it turns out, ArrayPool&lt;T&gt; doesn\u2019t actually need to work with the array instances in these cases, so it doesn\u2019t actually need the strongly-typed nature of the arrays; this could just as well be object[] or Array[]. And that\u2019s one of the main things that <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97058\">dotnet\/runtime#97058<\/a> does: with that, the compiled binary can carry the code generated for just Array[] rather than needing code for byte[][] and char[][] and object[][] and for whatever other type Ts are used with ArrayPool&lt;T&gt; in the application.<br \/>\n<strong>Avoiding unnecessarily generic code.<\/strong> The Native AOT compiler doesn\u2019t do any kind of \u201coutlining\u201d today (the opposite of inlining, where, rather than moving code from a called method into the caller, the compiler would extract code from a method out into a separate method). If you have a large method, the compiler will need to generate code for the whole method, and if that method is generic and multiple generic specializations are compiled, the whole method will be compiled and optimized for each. But, if you have any meaningful amounts of code in that method that don\u2019t actually depend on the generic types in question, you can avoid that duplication by refactoring the code into separate non-generic methods that are invoked by the generic. That\u2019s what <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101474\">dotnet\/runtime#101474<\/a> does in some of the types in Microsoft.Extensions.Logging.Console, like SimpleConsoleFormatter and JsonConsoleFormatter. There\u2019s a generic Write&lt;TState&gt; method, but the TState is only used in the very first line of the method, which formats the arguments into a string. After that, there\u2019s a lot of logic about doing the actual writing, but all of it only needs the output of that formatting operation, not the input. So, this PR simply refactors that Write&lt;TState&gt; to do the formatting and then delegates to the bulk of the work in a separate non-generic method.<br \/>\n<strong>Cutting out unnecessary dependencies.<\/strong> There are many small but meaningful dependencies one doesn\u2019t think about until they start focusing on generated code size and zooming in on exactly where all that code size is coming from. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/95710\">dotnet\/runtime#95710<\/a> is a good example. The AppContext.OnProcessExit method is rooted (never trimmed) by the runtime because it\u2019s invoked when the process is exiting. That OnProcessExit was accessing AppDomain.CurrentDomain, which returns an AppDomain. AppDomain\u2018s ToString override depends on a bunch of stuff. And ToString on a type that\u2019s not trimmed away is itself basically never trimmed because if anything anywhere calls to the base object.ToString, the system needs to know that any possible derived type that might find its way to that call site will be invokable. That all means that all of that stuff used by AppDomain.ToString was never being trimmed. This small refactoring made it so that all that stuff would only need to be kept if AppDomain.CurrentDomain is ever actually accessed by user code. Another example of this comes in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101858\">dotnet\/runtime#101858<\/a>, which removes a dependency on some of the Convert methods.<br \/>\n<strong>Using a better tool for the job.<\/strong> Sometimes there\u2019s just a better, simpler answer. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100916\">dotnet\/runtime#100916<\/a> highlights one such case. Some code in Microsoft.Extensions.DependencyInjection needed a MethodInfo for a particular method, and it was using System.Linq.Expressions to extract one, when it could instead just use a delegate. That\u2019s not only cheaper in terms of allocation and overhead, it removes a dependency on the Expressions library.<br \/>\n<strong>Compile time instead of run time.<\/strong> Source generators are a great boon for Native AOT, as they enable computing things at build time and baking the results into the assembly rather than computing those same things at run time (which, in the relevant situations, typically is done once and then cached). That\u2019s useful for startup performance, as you\u2019re not having to do that work just to get going. It\u2019s useful for steady-state throughput, as you can often take more time to do a better job when the work is being done at build time. But it\u2019s also useful for size, because it removes a dependency on anything that might have been used as part of the computation. And often that dependency is reflection, which brings with it a lot of size. As it turns out, System.Private.CoreLib has its own source generator that\u2019s used when building CoreLib, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102164\">dotnet\/runtime#102164<\/a> augmented that source generator to generate a dedicated implementation of Environment.Version and RuntimeInformation.FrameworkDescription. Previously, both of these methods that are implemented in CoreLib would use reflection to look up attributes also in CoreLib, but that\u2019s something the source generator can instead do at a build time, and just bake the answer into the implementation of these methods.<br \/>\n<strong>Avoiding duplication.<\/strong> It\u2019s not uncommon to have two methods somewhere in your app that have the same implementations, especially for small helper methods, like property accessors. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101969\">dotnet\/runtime#101969<\/a> teaches the Native AOT tool chain to deduplicate those, such that the code is only stored once.<br \/>\n<strong>Interfaces be gone.<\/strong> Previously, unused interface methods could be trimmed away (effectively removing them from the interface type and removing all implementations of that method), but the compiler wasn\u2019t able to fully remove the actual interface types themselves. Now with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100000\">dotnet\/runtime#100000<\/a>, it can.<br \/>\n<strong>Unnecessary static constructors.<\/strong> The trimmer was keeping the static constructor of a type if any field was accessed. This is unnecessarily broad: those static constructors only need to be kept if a <em>static<\/em> field was accessed. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96656\">dotnet\/runtime#96656<\/a> improves that.<\/p>\n<p>Previous releases saw a considerable amount of time spent on driving down binary sizes, but these kinds of changes chip away at them even further. Let\u2019s create a new ASP.NET minimal APIs application using Native AOT. This command uses the webapiaot template and creates the new project in a new myapp directory:<\/p>\n<p>dotnet new webapiaot -o myapp<\/p>\n<p>Replace the contents of the generated myapp.csproj with this:<\/p>\n<p>&lt;Project Sdk=&#8221;Microsoft.NET.Sdk.Web&#8221;&gt;<\/p>\n<p>  &lt;PropertyGroup&gt;<br \/>\n    &lt;TargetFrameworks&gt;net9.0;net8.0&lt;\/TargetFrameworks&gt;<br \/>\n    &lt;Nullable&gt;enable&lt;\/Nullable&gt;<br \/>\n    &lt;ImplicitUsings&gt;enable&lt;\/ImplicitUsings&gt;<br \/>\n    &lt;InvariantGlobalization&gt;true&lt;\/InvariantGlobalization&gt;<br \/>\n    &lt;PublishAot&gt;true&lt;\/PublishAot&gt;<\/p>\n<p>    &lt;OptimizationPreference&gt;Size&lt;\/OptimizationPreference&gt;<br \/>\n    &lt;StackTraceSupport&gt;false&lt;\/StackTraceSupport&gt;<br \/>\n  &lt;\/PropertyGroup&gt;<\/p>\n<p>&lt;\/Project&gt;<\/p>\n<p>All I\u2019ve done on top of the template\u2019s defaults is have both net9.0 and net8.0 as target frameworks and then add a couple of settings (at the bottom) focused on driving down the size of Native AOT apps. The app is a simple site that exposes a \/todos list as JSON.\n<\/p>\n<p>We can publish this app with Native AOT:<\/p>\n<p>dotnet publish -f net8.0 -r linux-x64 -c Release<br \/>\nls -hs bin\/Release\/net8.0\/linux-x64\/publish\/myapp<\/p>\n<p>which yields:<\/p>\n<p>9.4M bin\/Release\/net8.0\/linux-x64\/publish\/myapp<\/p>\n<p>We can see here that the whole site, web server, garbage collector, everything, are contained in myapp app, which on .NET 8 is weighing in at 9.4 megabytes. Now, let\u2019s do the same thing for .NET 9:<\/p>\n<p>dotnet publish -f net9.0 -r linux-x64 -c Release<br \/>\nls -hs bin\/Release\/net9.0\/linux-x64\/publish\/myapp<\/p>\n<p>which results in:<\/p>\n<p>8.5M bin\/Release\/net9.0\/linux-x64\/publish\/myapp<\/p>\n<p>Now, just by moving to the new version, that same myapp has shrunk to 8.5 megabytes, an ~10% reduction in binary size.<\/p>\n<p>Beyond a focus on size, ahead-of-time compilation also differs from just-in-time compilation in that each has their own opportunities for unique optimizations. The JIT can see the exact details of the current machine and employ the best possible instructions based on what\u2019s available (e.g. using AVX512 instructions on hardware that supports it), and the JIT can use dynamic PGO to evolve the code based on execution characteristic. But, Native AOT is capable of doing whole program optimization, where it can look at everything in the program and optimize based on the totality of everything involved (in contrast, a JIT\u2019d .NET application may load additional .NET libraries at any point). For example, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92923\">dotnet\/runtime#92923<\/a> enables automatically making fields readonly based on looking at the whole program to see if anything could possibly write to the field from outside of the constructor; this can in turn help things like improving pre-initialization.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99761\">dotnet\/runtime#99761<\/a> provides a nice example where, based on whole program analysis, the compiler can see that a particular type is never instantiated. If it\u2019s never instantiated, then type checks for that type will never succeed. And thus if a program has a check like if (variable is SomethingNeverInstantiated), that can be turned into a constant false, and all of the code associated with that if block then eliminated. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102248\">dotnet\/runtime#102248<\/a> is similar, but for types; if code is doing if (someType == typeof(X)) and the compiler never had to construct a method table for X, it can similarly turn this into a constant result.<\/p>\n<p>Whole program analysis is also applicable to devirtualization in really cool ways. With <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92440\">dotnet\/runtime#92440<\/a>, the compiler can now devirtualize all calls to a virtual method C.M if the compiler doesn\u2019t see any instantiations of types that derive from C. And with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97812\">dotnet\/runtime#97812<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97867\">dotnet\/runtime#97867<\/a>, the compiler can now treat virtual methods as instead being non-virtual and sealed when there are no overrides of those methods anywhere in the program.<\/p>\n<p>NativeAOT also has a super power in its ability to do pre-initialization. The compiler contains an interpreter that\u2019s able to evaluate code at build time and replace that code with just the result; for some objects, it\u2019s then also able to blit the object\u2019s data into the binary in a way that it can be cheaply dehydrated at execution time. The interpreter is limited in what it\u2019s able and allowed to do, but over time its capabilities are improving. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92470\">dotnet\/runtime#92470<\/a> extends it to support more type checks, static interface method calls, constrained method calls, and various operations on spans, while <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92666\">dotnet\/runtime#92666<\/a> expands it to have some support for hardware intrinsics and the various IsSupported methods. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92739\">dotnet\/runtime#92739<\/a> further rounds it out with support for stackalloc\u2018ing spans, IntPtr\/nint math, and Unsafe.Add.<\/p>\n<h2>Threading<\/h2>\n<p>Since the beginning of .NET, general wisdom has been that the vast majority of code that needs to synchronize access to shared state should just use Monitor, either directly or more likely via the the C# language syntax for it with lock(&#8230;). There are a plethora of other synchronization primitives available, at various levels of complexity and with varying purposes, but lock(&#8230;) is the workhorse and the thing that everyone should reach for by default.<\/p>\n<p>Over 20 years after the introduction of .NET, that\u2019s evolving, just a bit. lock(&#8230;) is still the go-to syntax, but in .NET 9 as of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87672\">dotnet\/runtime#87672<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102222\">dotnet\/runtime#102222<\/a>, there is now a dedicated System.Threading.Lock type. Anywhere you were previously allocating an object just to use that object with lock(&#8230;), you should consider using the new Lock type. You can absolutely still just use object, and you\u2019ll still need to do so in certain situations, like if you\u2019re using the \u201ccondition variable\u201d aspects of Monitor (such as Signal and Wait), and you\u2019ll still want to in others (such as if you\u2019re trying to reduce managed allocation and you have another existing object that can serve double-duty as the monitor). But locking a Lock can be a more efficient answer. It can also help to be self-documenting, making the code cleaner and more maintainable.<\/p>\n<p>As is evident from this benchmark, the syntax for using both can be identical.<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private readonly object _monitor = new();<br \/>\n    private readonly Lock _lock = new();<br \/>\n    private int _value;<\/p>\n<p>    [Benchmark]<br \/>\n    public void WithMonitor()<br \/>\n    {<br \/>\n        lock (_monitor)<br \/>\n        {<br \/>\n            _value++;<br \/>\n        }<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public void WithLock()<br \/>\n    {<br \/>\n        lock (_lock)<br \/>\n        {<br \/>\n            _value++;<br \/>\n        }<br \/>\n    }<br \/>\n}<\/p>\n<p>Lock, however, will generally be a tad cheaper (and in the future, as most locking shifts to use the new type, we may be able to make most objects lighter weight by not optimizing for direct locking on arbitrary objects):<\/p>\n<p>Method<br \/>\nMean<\/p>\n<p>WithMonitor<br \/>\n14.30 ns<\/p>\n<p>WithLock<br \/>\n13.86 ns<\/p>\n<p>Note that C# 13 has special-recognition of System.Threading.Lock. If you look at the code that\u2019s generated for WithMonitor above, it\u2019s equivalent to this:<\/p>\n<p>public void WithMonitor()<br \/>\n{<br \/>\n    object monitor = _monitor;<br \/>\n    bool lockTaken = false;<br \/>\n    try<br \/>\n    {<br \/>\n        Monitor.Enter(monitor, ref lockTaken);<br \/>\n        _value++;<br \/>\n    }<br \/>\n    finally<br \/>\n    {<br \/>\n        if (lockTaken)<br \/>\n        {<br \/>\n            Monitor.Exit(monitor);<br \/>\n        }<br \/>\n    }<br \/>\n}<\/p>\n<p>but even though the syntax is identical, here\u2019s an equivalent of what\u2019s generated for WithLock:<\/p>\n<p>Lock.Scope scope = _lock.EnterScope();<br \/>\ntry<br \/>\n{<br \/>\n    _value++;<br \/>\n}<br \/>\nfinally<br \/>\n{<br \/>\n    scope.Dispose();<br \/>\n}<\/p>\n<p>We\u2019ve also started using Lock internally. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103085\">dotnet\/runtime#103085<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103104\">dotnet\/runtime#103104<\/a> used it instead of object locks in Timer, ThreadLocal, and RegisteredWaitHandle. In time, I expect to see more and more use switched over.<\/p>\n<p>Of course, while locks are the default recommendation for synchronization, there\u2019s still a lot of code that demands the higher throughput and scalability that comes from more lock-free programming, and the workhorse for such implementations is Interlocked. In .NET 9, Interlocked.Exchange and Interlocked.CompareExchange gain some very welcome capabilities. First, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92974\">dotnet\/runtime#92974<\/a> from <a href=\"https:\/\/github.com\/MichalPetryka\">@MichalPetryka<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97588\">dotnet\/runtime#97588<\/a> from <a href=\"https:\/\/github.com\/filipnavara\">@filipnavara<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/106660\">dotnet\/runtime#106660<\/a> grant Interlocked some new powers, the ability to operate over types smaller than int. It introduces new overloads of Exchange and CompareExchange that can work on byte, sbyte, ushort, and short. These overloads are public and available for anyone to call, but they\u2019re also then consumed by <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97528\">dotnet\/runtime#97528<\/a> from <a href=\"https:\/\/github.com\/MichalPetryka\">@MichalPetryka<\/a> to improve Parallel.ForAsync&lt;T&gt;. ForAsync is given a range of T to be processed, and schedules multiple workers that all need to repeatedly get the next item from the range, until the range is exhausted. For arbitrary types, that means ForAsync needs to lock to protect the increment while iterating through the range. But for types where an Interlocked operation is available, we can use that with low-lock techniques to avoid the lock entirely (both the need to access it and the need to allocate it).<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public async Task ParallelForAsync()<br \/>\n    {<br \/>\n        await Parallel.ForAsync(&#8221;, &#8216;uFFFF&#8217;, async (c, _) =&gt;<br \/>\n        {<br \/>\n        });<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>ParallelForAsync<br \/>\n.NET 8.0<br \/>\n42.807 ms<br \/>\n1.00<\/p>\n<p>ParallelForAsync<br \/>\n.NET 9.0<br \/>\n7.184 ms<br \/>\n0.17<\/p>\n<p>Even with those new overloads, though, there are still places it\u2019s desirable to use Interlocked.Exchange or Interlocked.CompareExchange where they can\u2019t be used easily. Consider the aforementioned Parallel.ForAsync. It\u2019d be really nice if we could just call Interlocked.CompareExchange&lt;T&gt;, but CompareExchange&lt;T&gt; only works with reference types. So we\u2019re instead left with unsafe code:<\/p>\n<p>static unsafe bool CompareExchange(ref T location, T value, T comparand) =&gt;<br \/>\n    sizeof(T) == sizeof(byte) ? Interlocked.CompareExchange(ref Unsafe.As&lt;T, byte&gt;(ref location), Unsafe.As&lt;T, byte&gt;(ref value), Unsafe.As&lt;T, byte&gt;(ref comparand)) == Unsafe.As&lt;T, byte&gt;(ref comparand) :<br \/>\n    sizeof(T) == sizeof(ushort) ? Interlocked.CompareExchange(ref Unsafe.As&lt;T, ushort&gt;(ref location), Unsafe.As&lt;T, ushort&gt;(ref value), Unsafe.As&lt;T, ushort&gt;(ref comparand)) == Unsafe.As&lt;T, ushort&gt;(ref comparand) :<br \/>\n    sizeof(T) == sizeof(uint) ? Interlocked.CompareExchange(ref Unsafe.As&lt;T, uint&gt;(ref location), Unsafe.As&lt;T, uint&gt;(ref value), Unsafe.As&lt;T, uint&gt;(ref comparand)) == Unsafe.As&lt;T, uint&gt;(ref comparand) :<br \/>\n    sizeof(T) == sizeof(ulong) ? Interlocked.CompareExchange(ref Unsafe.As&lt;T, ulong&gt;(ref location), Unsafe.As&lt;T, ulong&gt;(ref value), Unsafe.As&lt;T, ulong&gt;(ref comparand)) == Unsafe.As&lt;T, ulong&gt;(ref comparand) :<br \/>\n    throw new UnreachableException();<\/p>\n<p>Another place it\u2019d be really nice to use Interlocked.Exchange and Interlocked.CompareExchange is with enums. It\u2019s reasonable common to use these APIs to transition between states in some algorithm, and often the ideal is for those states to be represented as an enum. However, there are no overloads of {Compare}Exchange that have worked with enums, so developers have been forced to use integers instead, often with comments stating something along the lines of \u201cThis should be an enum, but enums can\u2019t work with CompareExchange.\u201d Or, at least, they couldn\u2019t, until .NET 9.<\/p>\n<p>Now in .NET 9, as of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104558\">dotnet\/runtime#104558<\/a> the generic Exchange and CompareExchange have had their class constraint removed. This means use of Exchange&lt;T&gt; and CompareExchange&lt;T&gt; will compile for any T. Then at runtime, the T is checked to ensure it\u2019s a reference type, a primitive type, or an enum type; anything else, and it\u2019ll throw. When it is one of those, it delegates to the correspondingly-sized overload. For example, this now compiles and runs successfully:<\/p>\n<p>static DayOfWeek UpdateIfEqual(ref DayOfWeek location, DayOfWeek newValue, DayOfWeek expectedValue) =&gt;<br \/>\n    Interlocked.CompareExchange(ref location, newValue, expectedValue);<\/p>\n<p>This is not only good for usability, it\u2019s good for performance in a few ways. First, it enables performance improvements like the Parallel.ForAsync one described without needing to resort to Unsafe tricks. But second, it enables smaller objects. The previously listed change not only updated CompareExchange to remove the constraint but also then employed the overload in dozens of places. In Http3Connection, for example, the object previously had these three fields which were updated with Interlocked.Exchange:<\/p>\n<p>private int _haveServerControlStream;<br \/>\nprivate int _haveServerQpackDecodeStream;<br \/>\nprivate int _haveServerQpackEncodeStream;<\/p>\n<p>but these are really just bools masquerading as ints, exactly because they needed to be updated atomically. Now with Interlocked.Exchange&lt;T&gt; and Interlocked.CompareExchange&lt;T&gt; supporting bool, these have been updated to just be:<\/p>\n<p>private bool _haveServerControlStream;<br \/>\nprivate bool _haveServerQpackDecodeStream;<br \/>\nprivate bool _haveServerQpackEncodeStream;<\/p>\n<p>Any additional padding aside, that reduces 12 bytes down to 3 bytes on the object.<\/p>\n<p>Also related to Interlocked, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96258\">dotnet\/runtime#96258<\/a> intrinsifies the Interlocked.And and Interlocked.Or methods for additional platforms; previously they were specially handled on Arm, but now they\u2019re also specially handled on x86\/64. As an example, the implementation in the And method is a fairly typical CompareExchange loop:<\/p>\n<p>public static int And(ref int location1, int value)<br \/>\n{<br \/>\n    int current = location1;<br \/>\n    while (true)<br \/>\n    {<br \/>\n        int newValue = current &amp; value;<br \/>\n        int oldValue = CompareExchange(ref location1, newValue, current);<br \/>\n        if (oldValue == current)<br \/>\n        {<br \/>\n            return oldValue;<br \/>\n        }<br \/>\n        current = oldValue;<br \/>\n    }<br \/>\n}<\/p>\n<p>You\u2019ll see a very similar loop any time you want to use optimistic concurrency to create a new value and substitute it for the original in an atomic manner. The actual &amp; operation is just one line here, and to highlight that this is broadly applicable, you could create a generalized version of this method for any operation using a delegate, like this:<\/p>\n<p>public static int CompareExchange(ref int location1, int value, Func&lt;int, int, int&gt; update)<br \/>\n{<br \/>\n    int current = location1;<br \/>\n    while (true)<br \/>\n    {<br \/>\n        int newValue = update(current, value);<br \/>\n        int oldValue = CompareExchange(ref location1, newValue, current);<br \/>\n        if (oldValue == current)<br \/>\n        {<br \/>\n            return oldValue;<br \/>\n        }<br \/>\n        current = oldValue;<br \/>\n    }<br \/>\n}<\/p>\n<p>such that And could be implemented then like:<\/p>\n<p>public static int And(ref int location1, int value) =&gt;<br \/>\n    CompareExchange(ref location1, value, static (current, value) =&gt; current &amp; value)<\/p>\n<p>The approach employed by And is reasonable when there\u2019s nothing better you can do, but as it turns out, modern hardware platforms have single instructions capable of performing such an interlocked and and or in a much more efficient manner. The JIT already handled this for Arm because the instructions on Arm have semantics that very closely align with the semantics of Interlocked.And and Interlocked.Or. On x86\/64, however, the relevant instruction sequence (lock and or lock or) doesn\u2019t enable accessing the original value atomically replaced, whereas And\/Or require that as part of their definition. Luckily, most uses of Interlocked.And\/Interlocked.Or don\u2019t actually need the return value. For example, SafeHandle.SetHandleAsInvalid simply wants to atomically OR an additional flag into some bit flags, ignoring the result of Or:<\/p>\n<p>public void SetHandleAsInvalid()<br \/>\n{<br \/>\n    Interlocked.Or(ref _state, StateBits.Closed);<br \/>\n    GC.SuppressFinalize(this);<br \/>\n}<\/p>\n<p>And luckily, the JIT can see that it\u2019s ignoring the result. As such, on x86\/64, the JIT can use the optimal sequence when it can see that the result isn\u2019t being used, and even if it is being used, it can still emit a slightly more concise instruction sequence than would have naturally resulted from our open-coded implementation:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private int _location;<\/p>\n<p>    [Benchmark] public void Test_ResultNotUsed() =&gt; Interlocked.And(ref _location, 42);<br \/>\n    [Benchmark] public int  Test_ResultUsed() =&gt;    Interlocked.And(ref _location, 42);<br \/>\n}<br \/>\n\/\/ .NET 8<br \/>\n; Tests.Test_ResultNotUsed()<br \/>\n       push      rbp<br \/>\n       sub       rsp,10<br \/>\n       lea       rbp,[rsp+10]<br \/>\n       add       rdi,8<br \/>\n       mov       eax,[rdi]<br \/>\nM00_L00:<br \/>\n       mov       ecx,eax<br \/>\n       and       ecx,2A<br \/>\n       mov       [rbp-4],eax<br \/>\n       lock cmpxchg [rdi],ecx<br \/>\n       mov       ecx,[rbp-4]<br \/>\n       cmp       eax,ecx<br \/>\n       je        short M00_L01<br \/>\n       mov       ecx,eax<br \/>\n       mov       eax,ecx<br \/>\n       jmp       short M00_L00<br \/>\nM00_L01:<br \/>\n       add       rsp,10<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\n; Total bytes of code 47<\/p>\n<p>; Tests.Test_ResultUsed()<br \/>\n       push      rbp<br \/>\n       sub       rsp,10<br \/>\n       lea       rbp,[rsp+10]<br \/>\n       add       rdi,8<br \/>\n       mov       eax,[rdi]<br \/>\nM00_L00:<br \/>\n       mov       ecx,eax<br \/>\n       and       ecx,2A<br \/>\n       mov       [rbp-4],eax<br \/>\n       lock cmpxchg [rdi],ecx<br \/>\n       mov       ecx,[rbp-4]<br \/>\n       cmp       eax,ecx<br \/>\n       je        short M00_L01<br \/>\n       mov       ecx,eax<br \/>\n       mov       eax,ecx<br \/>\n       jmp       short M00_L00<br \/>\nM00_L01:<br \/>\n       add       rsp,10<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\n; Total bytes of code 47<\/p>\n<p>\/\/ .NET 9<br \/>\n; Tests.Test_ResultNotUsed()<br \/>\n       add       rdi,8<br \/>\n       mov       eax,2A<br \/>\n       lock and  [rdi],eax<br \/>\n       ret<br \/>\n; Total bytes of code 13<\/p>\n<p>; Tests.Test_ResultUsed()<br \/>\n       add       rdi,8<br \/>\n       mov       ecx,2A<br \/>\n       mov       eax,[rdi]<br \/>\nM00_L00:<br \/>\n       mov       edx,eax<br \/>\n       and       edx,ecx<br \/>\n       lock cmpxchg [rdi],edx<br \/>\n       jne       short M00_L00<br \/>\n       ret<br \/>\n; Total bytes of code 22<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nCode Size<\/p>\n<p>Test_ResultNotUsed<br \/>\n.NET 8.0<br \/>\n6.630 ns<br \/>\n1.00<br \/>\n47 B<\/p>\n<p>Test_ResultNotUsed<br \/>\n.NET 9.0<br \/>\n3.132 ns<br \/>\n0.47<br \/>\n13 B<\/p>\n<p>Test_ResultUsed<br \/>\n.NET 8.0<br \/>\n6.853 ns<br \/>\n1.00<br \/>\n47 B<\/p>\n<p>Test_ResultUsed<br \/>\n.NET 9.0<br \/>\n6.435 ns<br \/>\n0.94<br \/>\n22 B<\/p>\n<p>Locks and interlocked operations are about coordinating between operations, at a relatively low level. There are higher level coordination constructs as well; that\u2019s effectively what Task is, providing a representation for a piece of work with which you can later join. Such joining can be accomplished with await along with a myriad of APIs that faciliate joining with tasks in various ways. In that regard, one of my favorite new APIs in .NET 9 is on Task: Task.WhenEach. I like it because it utilizes newer language features to cleanly solve a problem that we wanted to solve over a decade ago when Task was originally introduced, and the lack of it has led to folks writing code with known pits of failure.<\/p>\n<p>Task.WhenAll is fairly easy to understand: you give it a collection of tasks, and the task it returns will complete only when all of the constituent tasks have completed:<\/p>\n<p>await Task.WhenAll([t1, t2, t3]);<br \/>\n&#8230; \/\/ only get here when t1, t2, and t3 have all completed successfully<\/p>\n<p>Task.WhenAny is a bit more complex, in that it returns when any of the constituent tasks has completed, and it gives you back a reference to that task:<\/p>\n<p>Task tCompleted = await Task.WhenAny([t1, t2, t3]);<br \/>\n&#8230; \/\/ tCompleted is either t1, t2, or t3, and will be completed here<\/p>\n<p>and you can then explicitly join with that returned task to observe any exceptions it may have incurred or consume its result value if it has one. But what do you then do to join with the remaining two tasks? You might end up writing code something like this:<\/p>\n<p>List&lt;Task&gt; tasks = new() { t1, t2, t3 };<br \/>\nwhile (tasks.Count &gt; 0)<br \/>\n{<br \/>\n    Task completed = await Task.WhenAny(tasks);<br \/>\n    Handle(completed);<br \/>\n    tasks.Remove(completed);<br \/>\n}<\/p>\n<p>That\u2019s not terribly hard, but it\u2019s also not terribly efficient. Or, rather, for larger number of tasks, it\u2019s terribly inefficient, as it\u2019s an O(N^2) algorithm. Some of the complexity is likely obvious: you\u2019ve got a loop and inside that loop you\u2019ve got a List&lt;T&gt;.Remove call, which will do an O(N) walk of the list looking for the target element to remove: there\u2019s the O(N^2). But, there\u2019s actually another less obvious O(N) operation in the loop: the WhenAny itself. Every call to WhenAny needs to hook a continuation up to each of the constituent Task objects. (There are of course cheaper ways to implement this functionality than using such a WhenAny, but they\u2019re all more complicated and thus not the answers towards which folks have gravitated.)<\/p>\n<p>Enter Task.WhenEach. WhenEach\u2018s purpose is to make consuming tasks as they complete both simple and efficient. To do so, rather than returning a Task&lt;Task&gt; as does WhenAny, it returns an IAsyncEnumerable&lt;Task&gt;, so one can simply iterate through the completing tasks as they complete.<\/p>\n<p>await foreach (Task completed in Task.WhenEach([t1, t2, t3]))<br \/>\n{<br \/>\n    Debug.Assert(completed.IsCompleted);<br \/>\n    Handle(completed);<br \/>\n}<\/p>\n<p>It\u2019s a little hard to get a good applies-to-apples comparison of the overhead here, but this benchmark is a reasonable approximation:<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Params(10, 1_000)]<br \/>\n    public int Count { get; set; }<\/p>\n<p>    [Benchmark]<br \/>\n    public async Task WithWhenAny()<br \/>\n    {<br \/>\n        var tcs = Enumerable.Range(0, Count).Select(_ =&gt; new TaskCompletionSource()).ToList();<\/p>\n<p>        List&lt;Task&gt; tasks = tcs.Select(t =&gt; t.Task).ToList();<br \/>\n        tcs[^1].SetResult();<br \/>\n        while (tasks.Count &gt; 0)<br \/>\n        {<br \/>\n            Task completed = await Task.WhenAny(tasks);<br \/>\n            tasks.Remove(completed);<br \/>\n            tcs.RemoveAt(tcs.Count &#8211; 1);<\/p>\n<p>            if (tasks.Count == 0) break;<br \/>\n            tcs[^1].SetResult();<br \/>\n        }<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public async Task WithWhenEach()<br \/>\n    {<br \/>\n        var tcs = Enumerable.Range(0, Count).Select(_ =&gt; new TaskCompletionSource()).ToList();<\/p>\n<p>        int remaining = tcs.Count &#8211; 1;<br \/>\n        tcs[remaining].SetResult();<br \/>\n        await foreach (Task completed in Task.WhenEach(tcs.Select(t =&gt; t.Task)))<br \/>\n        {<br \/>\n            if (remaining == 0) break;<br \/>\n            tcs[&#8211;remaining].SetResult();<br \/>\n        }<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nCount<br \/>\nMean<br \/>\nAllocated<\/p>\n<p>WithWhenAny<br \/>\n10<br \/>\n3.232 us<br \/>\n3.47 KB<\/p>\n<p>WithWhenEach<br \/>\n10<br \/>\n1.223 us<br \/>\n1.43 KB<\/p>\n<p>WithWhenAny<br \/>\n1000<br \/>\n20,082.683 us<br \/>\n4207.12 KB<\/p>\n<p>WithWhenEach<br \/>\n1000<br \/>\n102.759 us<br \/>\n94.24 KB<\/p>\n<p>WhenAll also gets a bit cheaper, in a couple of ways. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/93953\">dotnet\/runtime#93953<\/a> utilizes a trick employed elsewhere in Task in .NET 8, which is to use its used m_stateObject field (unused because there\u2019s no way to set it with WhenAll) to store some of the state that previously had a dedicated field (a field for storing information about constituent tasks that failed or were canceled). That means the Task object WhenAll returns gets 8 bytes smaller (on 64-bit). On top of that, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101308\">dotnet\/runtime#101308<\/a> adds new ReadOnlySpan&lt;T&gt;-based overloads to a bunch of methods, including Task.WhenAll. This enables passing in any number of tasks without needing to allocate.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Runtime.CompilerServices;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public async Task WhenAll()<br \/>\n    {<br \/>\n        var atmb1 = new AsyncTaskMethodBuilder();<br \/>\n        var atmb2 = new AsyncTaskMethodBuilder();<br \/>\n        Task whenAll = Task.WhenAll([atmb1.Task, atmb2.Task]);<br \/>\n        atmb1.SetResult();<br \/>\n        atmb2.SetResult();<br \/>\n        await whenAll;<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>WhenAll<br \/>\n.NET 8.0<br \/>\n123.8 ns<br \/>\n1.00<br \/>\n264 B<br \/>\n1.00<\/p>\n<p>WhenAll<br \/>\n.NET 9.0<br \/>\n103.8 ns<br \/>\n0.86<br \/>\n216 B<br \/>\n0.82<\/p>\n<p>There are some other interesting performance improvements in threading in .NET 9 as well.<\/p>\n<p><strong>Debugger.NotifyOfCrossThreadDependency.<\/strong> This is a big deal. When you\u2019re debugging a .NET process and you break in the debugger, it pauses all threads in the debuggee process so that nothing is making forward progress while you examine state. However, .NET debuggers, like the one in Visual Studio, support invoking properties and methods in the debuggee while debugging. That can be a big problem if the functionality being invoked relies on one of those paused threads to do something, e.g. if the property you access tries to take a lock that\u2019s held by another thread or tries to Wait on a Task. To mitigate problems here, the Debugger.NotifyOfCrossThreadDependency method exists. Functionality that relies on another thread to do something can call NotifyOfCrossThreadDependency; if there\u2019s no debugger attached, it\u2019s a nop, but if there is a debugger attached, this signals the problem to the debugger, which can then react accordingly. The Visual Studio debugger reacts by stopping the evaluation but then by offering an opt-in option of \u201cslipping\u201d all threads, unpausing all threads until the evaluated operation completes, at which point all threads will be paused again, thereby again trying to mitigate any problems that might occur from the cross-thread dependency. NotifyOfCrossThreadDependency is generally not used by application code, but it\u2019s used in a few critical choke points in the core libraries, in particular throughout System.Threading and the infrastructure for async\/await. That means, for example, that this method is being called any time you await a Task that\u2019s not yet completed. And, unfortunately, while the method is a cheap nop when the debugger isn\u2019t attached, historically it\u2019s been fairly expensive when the debugger is attached, to the point where it can meaningfully impact a developer\u2019s experience in the tool. Thankfully, .NET 9 addresses this with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101864\">dotnet\/runtime#101864<\/a>, which significantly improves the performance of NotifyOfCrossThreadDependency when a debugger is attached. We can see this with a low-tech benchmark. Replace the contents of your Program.cs with this:<br \/>\nusing System.Diagnostics;<\/p>\n<p>const int Iters = 100_000;<br \/>\nStopwatch sw = new();<br \/>\nwhile (true)<br \/>\n{<br \/>\n    sw.Restart();<br \/>\n    for (int i = 0; i &lt; Iters; i++)<br \/>\n    {<br \/>\n        Debugger.NotifyOfCrossThreadDependency();<br \/>\n    }<br \/>\n    sw.Stop();<br \/>\n    Console.WriteLine($&#8221;{sw.Elapsed.TotalMicroseconds \/ Iters:N3}us&#8221;);<br \/>\n}<\/p>\n<p>open the project in Visual Studio, ensure that .NET 8 is selected as the target framework and that you\u2019re targeting Release:\n<\/p>\n<p>and run with the debugger attached (F5, not ctrl-F5). When I do that, I see numbers like this (on Windows):<\/p>\n<p>48.360us<br \/>\n45.281us<br \/>\n46.714us<br \/>\n46.945us<br \/>\n46.525us<\/p>\n<p>Then change the target framework to be .NET 9, and run with the debugger attached again. I then see numbers like this:<\/p>\n<p>1.973us<br \/>\n1.713us<br \/>\n1.714us<br \/>\n1.871us<br \/>\n1.963us<\/p>\n<p>While such an improvement shouldn\u2019t impact your production workloads, it can make an impactful difference to your productivity as a developer.<\/p>\n<p><strong>Volatile.<\/strong> A \u201cmemory model\u201d is a description of how threads interact with memory and what guarantees are made about how different threads produce and consume changes in shared memory. Reads and writes to memory from a single thread are guaranteed to be observed by that thread in the order they occurred, but once multiple threads enter the picture, it\u2019s up to the memory model to define what behaviors can be relied on and which can\u2019t. For example, if there are two fields, _a and _b, both of which start as 0, and if one thread does:<br \/>\n_a = 1;<br \/>\n_b = 2;<\/p>\n<p>and then another does:<\/p>\n<p>while (_b != 2);<br \/>\nAssert(_a == 1);<\/p>\n<p>is that assert guaranteed to always pass? It depends on the memory model, and whether the writes from thread 1 might get reordered (by any of the involved compilers or even by the hardware) such that the write to _b became visible to thread 2 before the write to _a. For the longest time, the only official memory model for .NET was the one defined by the <a href=\"https:\/\/ecma-international.org\/publications-and-standards\/standards\/ecma-335\/\">ECMA 335<\/a> specification, but real implementations, including coreclr, generally had stronger guarantees than what ECMA detailed. Thankfully, the official <a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/main\/docs\/design\/specs\/Memory-model.md\">.NET memory model<\/a> has now been documented. However, some of the practices that were being employed in the core libraries (due to defensive coding or uncertainty of the memory model or out-of-date requirements) are no longer necessary. One of the main tools available for folks coding at a level where memory model is relevant is the volatile keyword \/ the Volatile class. Marking a field as volatile causes any reads or writes of that field to be considered \u201cvolatile,\u201d just as does using Volatile.Read\/Volatile.Write to perform that read or write. Making the read or write volatile means it prevents certain kinds of \u201cmovement,\u201d e.g. if both _a and _b in the previous example were marked as volatile, the assert would always pass. Marking fields or operations as volatile can come with an expense, depending on the circumstance and the target platform. For example, it can restrict the C# compiler and the JIT compiler from performing certain optimizations. Let\u2019s take a simple example. This code:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private volatile int _volatile;<br \/>\n    private int _nonVolatile;<\/p>\n<p>    [Benchmark]<br \/>\n    public int UsingVolatile() =&gt; _volatile + _volatile;<\/p>\n<p>    [Benchmark]<br \/>\n    public int UsingNonVolatile() =&gt; _nonVolatile + _nonVolatile;<br \/>\n}<\/p>\n<p>results in this assembly on .NET 9:<\/p>\n<p>; Tests.UsingVolatile()<br \/>\n       mov       eax,[rdi+8]<br \/>\n       add       eax,[rdi+8]<br \/>\n       ret<br \/>\n; Total bytes of code 7<\/p>\n<p>; Tests.UsingNonVolatile()<br \/>\n       mov       eax,[rdi+0C]<br \/>\n       add       eax,eax<br \/>\n       ret<br \/>\n; Total bytes of code 6<\/p>\n<p>The important difference between the two assembly blocks is in the add instruction. In the UsingVolatile method, the first instruction is loading the value from memory stored at address rcx+8, and then re-reading that same rcx+8 memory location again to add whatever is there with what it just read. In UsingNonVolatile, it starts the same way, reading the value stored at rcx+0xc, but then the add isn\u2019t doing another memory read and is instead just doubling the value stored in the register. One of the effects of volatile requiring that reads can\u2019t be moved is also that they can\u2019t be <em>removed<\/em>, which means both reads in the code are required to stay. Here\u2019s another example:<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private volatile bool _volatile;<br \/>\n    private bool _nonVolatile;<\/p>\n<p>    [Benchmark]<br \/>\n    public int CountVolatile()<br \/>\n    {<br \/>\n        int count = 0;<br \/>\n        while (_volatile) count++;<br \/>\n        return count;<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public int CountNonVolatile()<br \/>\n    {<br \/>\n        int count = 0;<br \/>\n        while (_nonVolatile) count++;<br \/>\n        return count;<br \/>\n    }<br \/>\n}<\/p>\n<p>which on .NET 9 produces this assembly:<\/p>\n<p>; Tests.CountVolatile()<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       xor       eax,eax<br \/>\n       cmp       byte ptr [rdi+8],0<br \/>\n       jne       short M00_L01<br \/>\nM00_L00:<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM00_L01:<br \/>\n       inc       eax<br \/>\n       cmp       byte ptr [rdi+8],0<br \/>\n       jne       short M00_L01<br \/>\n       jmp       short M00_L00<br \/>\n; Total bytes of code 24<\/p>\n<p>; Tests.CountNonVolatile()<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       xor       eax,eax<br \/>\n       cmp       byte ptr [rdi+9],0<br \/>\n       jne       short M00_L00<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM00_L00:<br \/>\n       jmp       short M00_L00<br \/>\n; Total bytes of code 16<\/p>\n<p>They look somewhat similar, in fact the first five instructions are almost identical, but there\u2019s a critical difference. In both cases, the bool value is being loaded and checked to see if it\u2019s false (the cmp against 0 followed by a conditional jump), in which case the implementations both fall through to the ending ret to exit out of the method. The compiler is rewriting the while (cond) { &#8230; } loop to instead be more like an if (cond) { do { &#8230; } while(cond); }, so this initial test is the one for that if (cond). But then things diverge meaningfully. CountVolatile then proceeds to do the do while equivalent, incrementing the count (stored in eax), reading _volatile and comparing it to 0 (false), and if it\u2019s still true, jumping back up to loop again. So basically what you\u2019d expect. But now look at CountNonVolatile. The loop is just this:<\/p>\n<p>M00_L00:<br \/>\n       jmp       short M00_L00<\/p>\n<p>It\u2019s now sitting in an infinite loop, with an unconditional jump back to the same jmp instruction, looping forever. That\u2019s because the JIT was able to hoist the read of _nonVolatile out of the loop. It then also sees that no one will ever observe count\u2018s changed value, so it also elides the increment. At which point it\u2019s more like if I\u2019d written this C#:<\/p>\n<p>public int CountNonVolatile()<br \/>\n{<br \/>\n    int count = 0;<br \/>\n    if (_nonVolatile)<br \/>\n    {<br \/>\n        while (true);<br \/>\n    }<\/p>\n<p>    return count;<br \/>\n}<\/p>\n<p>That hoisting can\u2019t be done when the field is volatile, because it can\u2019t reorder or remove reads associated with the field. But with _nonVolatile, nothing prevents that. On multiple occasions I\u2019ve seen folks trying to engage in low-lock programming experience the ramifications of this latter example: they\u2019ll be using some bool to signal to a consumer that it should break out of the loop, but the bool isn\u2019t volatile, and the consumer then never notices when the producer eventually sets it.<\/p>\n<p>Those are examples of the ramifications of volatile in terms of what the C# or JIT compiler are constrained from doing. But there are also things the JIT <em>needs<\/em> to do (rather than avoid) in order to ensure the hardware can respect the requirements put in place by the developer. On some hardware, like x64, the memory model of the hardware is relatively \u201cstrong,\u201d meaning it doesn\u2019t do most of the kinds of reorderings that volatile inhibits, and therefore you won\u2019t see anything emitted into the assembly code by the JIT to help the hardware enforce the constraints. On other hardware, like Arm64, though, the hardware has a relatively \u201cweaker\u201d model, meaning it allows more of these kinds of reorderings, and as a result, the JIT needs to actively inhibit such reorderings by inserting appropriate \u201cmemory barriers\u201d into the code. On Arm, this shows up with instructions like dmb (\u201cdata memory barrier\u201d). Such barriers have some overhead associated with them.<\/p>\n<p>For all of these reasons, fewer volatiles is good for performance, but of course you need to ensure you have enough volatiles to actually achieve a correct application (with the best answer being avoid writing lock-lock code in the first place, and then you never need to know or think about volatile). It\u2019s a balance. Luckily, and bringing us full circle to why we\u2019re talking about this, there are a set of common cases where volatile used to be recommended but now that we have a well-defined memory model, those uses are obsolete. Removing them can help to avoid a layer of thin cost across the code. So <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100969\">dotnet\/runtime#100969<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101346\">dotnet\/runtime#101346<\/a> removed a bunch of volatile usage where it was no longer necessary. Almost all of these uses were as part of lazy initialization of reference types, e.g.<\/p>\n<p>private volatile MyReferenceType? _instance;<\/p>\n<p>public MyReferenceType Instance =&gt; _instance ??= new MyReferenceType();<\/p>\n<p>which if we expand that out to not use ??= looks something like this:<\/p>\n<p>private MyReferenceType? _instance;<\/p>\n<p>public MyReferenceType Instance<br \/>\n{<br \/>\n    get<br \/>\n    {<br \/>\n        MyReferenceType? instance = _instance;<br \/>\n        if (instance is null)<br \/>\n        {<br \/>\n            _instance = instance = new MyReferenceType();<br \/>\n        }<\/p>\n<p>        return instance;<br \/>\n    }<br \/>\n}<\/p>\n<p>The reason for the volatile here would be two-fold, one for the part of the operation that reads and one for the part of the operation that writes. Without the volatile, the concern would be that one of the compilers or the hardware could actually \u201cintroduce a read,\u201d effectively making the code equivalent to this:<\/p>\n<p>private MyReferenceType? _instance;<\/p>\n<p>public MyReferenceType Instance<br \/>\n{<br \/>\n    get<br \/>\n    {<br \/>\n        MyReferenceType? instance = _instance;<br \/>\n        if (_instance is null)<br \/>\n        {<br \/>\n            _instance = instance = new MyReferenceType();<br \/>\n        }<\/p>\n<p>        return instance;<br \/>\n    }<br \/>\n}<\/p>\n<p>If that were to happen, there\u2019s a problem that between the two reads, the value of _instance could go from null to non-null, in which case instance could be assigned null, _instance is null might then be false, and return instance would return null. However, the .NET memory model explicitly states \u201cReads cannot be introduced.\u201d Then there\u2019s the concern about the write. The concern there that leads to volatile being used is the initialization that happens inside of MyReferenceType. Imagine if MyReferenceType were defined like this:<\/p>\n<p>class MyReferenceType()<br \/>\n{<br \/>\n    internal int _value;<br \/>\n    public MyReferenceType() =&gt; _value = 42;<br \/>\n}<\/p>\n<p>The question then becomes \u201cis it possible for the write to _value inside of the constructor to be viewed by another thread <em>after<\/em> the write of the instance to _instance\u201c? In other words, could the code logically become the equivalent of this:<\/p>\n<p>private MyReferenceType? _instance;<\/p>\n<p>public MyReferenceType Instance<br \/>\n{<br \/>\n    get<br \/>\n    {<br \/>\n        MyReferenceType? instance = _instance;<br \/>\n        if (_instance is null)<br \/>\n        {<br \/>\n            _instance = instance = RuntimeHelpers.GetUninitializedObject(typeof(MyReferenceType));<br \/>\n            instance._value = 42;<br \/>\n        }<\/p>\n<p>        return instance;<br \/>\n    }<br \/>\n}<\/p>\n<p>If that could happen, then two threads could be racing to access Instance, one of them could get as far as setting _instance (but _value hasn\u2019t been set yet), then another thread could access Instance, see _instance as non-null, and start using it, even though _value hasn\u2019t yet been initialized. Thankfully here as well, the .NET memory model doesn\u2019t allow such transformations, explicitly covering this point:<\/p>\n<p>\u201cObject assignment to a location potentially accessible by other threads is a release with respect to accesses to the instance\u2019s fields\/elements and metadata. An optimizing compiler must preserve the order of object assignment and data-dependent memory accesses. The motivation is to ensure that storing an object reference to shared memory acts as a \u201ccommitting point\u201d to all modifications that are reachable through the instance reference.\u201d<\/p>\n<p>Phew!<\/p>\n<p><strong>ManagedThreadId.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/91232\">dotnet\/runtime#91232<\/a> is fun, in a \u201cwhy didn\u2019t we already do this\u201d sort of way. Thread.ManagedThreadId is implemented as an internal call (an FCALL) into the runtime, resulting in a call to ThreadNative::GetManagedThreadId, which in turn reads the thread object\u2019s m_ManagedThreadId field. At least, that\u2019s the field in the C definition of the object. The managed Thread object has corresponding fields at the exact location that are available for the C# code to use, in this case _managedThreadId. So what did this PR do? It removed those complicated gymnastics and just made the whole implementation be public int ManagedThreadId =&gt; _managedThreadId. (It\u2019s worth noting, though, that Thread.CurrentThread.ManagedThreadId was already previously recognized specially by the JIT, so this is only relevant when accessing the ManagedThreadId from some other Thread instance.) The main benefit of this is avoiding the extra function call, as the FCALL can\u2019t be inlined.<br \/>\nusing BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private Thread _thread = Thread.CurrentThread;<\/p>\n<p>    [Benchmark]<br \/>\n    public int GetID() =&gt; _thread.ManagedThreadId;<br \/>\n}<br \/>\n\/\/ .NET 8<br \/>\n; Tests.GetID()<br \/>\n       mov       rdi,[rdi+8]<br \/>\n       cmp       [rdi],edi<br \/>\n       jmp       near ptr System.Threading.Thread.get_ManagedThreadId()<br \/>\n; Total bytes of code 11<br \/>\n**Extern method**<br \/>\nSystem.Threading.Thread.get_ManagedThreadId()<\/p>\n<p>\/\/ .NET 9<br \/>\n; Tests.GetID()<br \/>\n       mov       rax,[rdi+8]<br \/>\n       mov       eax,[rax+34]<br \/>\n       ret<br \/>\n; Total bytes of code 8<\/p>\n<p><strong>Ports to NativeAOT.<\/strong> Previous releases of .NET enabled inlining the fast path of thread-local state (TLS) access on coreclr. With <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104282\">dotnet\/runtime#104282<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/89472\">dotnet\/runtime#89472<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97910\">dotnet\/runtime#97910<\/a>, this improvement comes to NativeAOT as well. Similarly, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103675\">dotnet\/runtime#103675<\/a> ports coreclr\u2019s \u201cyield normalization\u201d implementation to NativeAOT; this is in support of enabling the runtime to measure the cost of various pause instructions, which can then be used as part of tuning spinning and spin waiting.<br \/>\n<strong>Startup time.<\/strong> Performance improvements related to threading are generally about steady-state throughput improvements, e.g. reducing synchronization costs while processing requests. That\u2019s what makes <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/106724\">dotnet\/runtime#106724<\/a> from <a href=\"https:\/\/github.com\/harisokanovic\">@harisokanovic<\/a> so interesting, in that it\u2019s instead about reducing startup overheads of a process using .NET on Linux. The GC uses the equivalent of a process-wide memory barrier (also exposed publicly as Interlocked.MemoryBarrierProcessorWide) to ensure that all threads involved in a collection see a consistent state. On Linux, implementing this method efficiently involves using the membarrier system call with MEMBARRIER_CMD_PRIVATE_EXPEDITED, and using that requires the same syscall to have been made earlier on with MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED, which really means doing so at startup. However, the Linux kernel has some optimizations that make this registration a lot cheaper to use when there\u2019s only one thread in the process. The way it was being used in .NET previously guaranteed there would be multiple. This PR changed where this initialization was performed in order to maximize the possibility of there only being the single thread in the process, which in turn makes startup faster. The improvement was upwards of 10ms on various systems on which it was measured, which is a large percentage of a .NET process\u2019 startup overhead on Linux.<\/p>\n<h2>Reflection<\/h2>\n<p>Reflection is a very powerful (though sometimes overused) capability of .NET that enables code to load and introspect .NET assemblies and invoke their functionality. It is used in all manner of library and application, including by the core .NET libraries themselves, and it\u2019s important that we continue to find ways to reduce the overheads associated with reflection.<\/p>\n<p>Several PRs in .NET 9 whittle away at some of the allocation overheads in reflection. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92310\">dotnet\/runtime#92310<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/93115\">dotnet\/runtime#93115<\/a> avoid some defensive array copies by instead handing around ReadOnlySpan&lt;T&gt; instances, while <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/95952\">dotnet\/runtime#95952<\/a> removed a use of string.Split that turned out to only be used with constants and thus could be replaced by just manually splitting those constants. But a more interesting and impactful addition comes from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97683\">dotnet\/runtime#97683<\/a>, which added an allocation-free way to get the invocation list from a delegate. Delegates in .NET are \u201cmulticast,\u201d meaning a single delegate instance might actually represent multiple methods to be invoked; this is how .NET events are implemented. If I invoke a delegate, the delegate implementation handles invoking each constituent method, sequentially, in turn. But what if I want to customize the invocation logic? Maybe I want to wrap each individual method in a try\/catch, or maybe I want to track the return values from all of the methods rather than just the last, or some such behavior. To achieve that, delegates expose a way to get an array of delegates, one for each method that\u2019s part of the original. So, if I have:<\/p>\n<p>Action action = () =&gt; Console.Write(&#8220;A &#8220;);<br \/>\naction += () =&gt; Console.Write(&#8220;B &#8220;);<br \/>\naction += () =&gt; Console.Write(&#8220;C &#8220;);<br \/>\naction();<\/p>\n<p>that\u2019ll print out &#8220;A B C &#8220;, and if I have:<\/p>\n<p>Action action = () =&gt; Console.Write(&#8220;A &#8220;);<br \/>\naction += () =&gt; Console.Write(&#8220;B &#8220;);<br \/>\naction += () =&gt; Console.Write(&#8220;C &#8220;);<\/p>\n<p>Delegate[] actions = action.GetInvocationList();<br \/>\nfor (int i = 0; i &lt; actions.Length; ++i)<br \/>\n{<br \/>\n    Console.Write($&#8221;{i}: &#8220;);<br \/>\n    ((Action)actions[i])();<br \/>\n    Console.WriteLine();<br \/>\n}<\/p>\n<p>that\u2019ll print out:<\/p>\n<p>0: A<br \/>\n1: B<br \/>\n2: C<\/p>\n<p>However, that GetInvocationList needs to allocate. Now in .NET 9, there\u2019s the new Delegate.EnumerateInvocationList&lt;TDelegate&gt; method, which returns a struct-based enumerable for iterating through the delegates rather than needing to allocate a new array to store them all.<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private Action _action;<br \/>\n    private int _count;<\/p>\n<p>    [GlobalSetup]<br \/>\n    public void Setup()<br \/>\n    {<br \/>\n        _action = () =&gt; _count++;<br \/>\n        _action += () =&gt; _count += 2;<br \/>\n        _action += () =&gt; _count += 3;<br \/>\n    }<\/p>\n<p>    [Benchmark(Baseline = true)]<br \/>\n    public void GetInvocationList()<br \/>\n    {<br \/>\n        foreach (Action action in _action.GetInvocationList())<br \/>\n        {<br \/>\n            action();<br \/>\n        }<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public void EnumerateInvocationList()<br \/>\n    {<br \/>\n        foreach (Action action in Delegate.EnumerateInvocationList(_action))<br \/>\n        {<br \/>\n            action();<br \/>\n        }<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>GetInvocationList<br \/>\n32.11 ns<br \/>\n1.00<br \/>\n48 B<br \/>\n1.00<\/p>\n<p>EnumerateInvocationList<br \/>\n11.07 ns<br \/>\n0.34<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>Reflection is particularly important with libraries involved in dependency injection, where object construction is frequently done in a more dynamic fashion. ActivatorUtilities.CreateInstance plays a key role there, and has also seen improvements from allocation reduction. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99383\">dotnet\/runtime#99383<\/a>, in particular, helped to significantly reduce allocation by employing the ConstructorInvoker type introduced in .NET 8, and by piggybacking on the changes from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99175\">dotnet\/runtime#99175<\/a> to cut down on the number of constructors it needs to examine.<\/p>\n<p>\/\/ Add a &lt;PackageReference Include=&#8221;Microsoft.Extensions.DependencyInjection&#8221; Version=&#8221;8.0.0&#8243; \/&gt; to the csproj.<br \/>\n\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Configs;<br \/>\nusing BenchmarkDotNet.Environments;<br \/>\nusing BenchmarkDotNet.Jobs;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing Microsoft.Extensions.DependencyInjection;<\/p>\n<p>var config = DefaultConfig.Instance<br \/>\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core80)<br \/>\n        .WithNuGet(&#8220;Microsoft.Extensions.DependencyInjection&#8221;, &#8220;8.0.0&#8221;)<br \/>\n        .WithNuGet(&#8220;Microsoft.Extensions.DependencyInjection.Abstractions&#8221;, &#8220;8.0.1&#8221;).AsBaseline())<br \/>\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core90)<br \/>\n        .WithNuGet(&#8220;Microsoft.Extensions.DependencyInjection&#8221;, &#8220;9.0.0-rc.1.24431.7&#8221;)<br \/>\n        .WithNuGet(&#8220;Microsoft.Extensions.DependencyInjection.Abstractions&#8221;, &#8220;9.0.0-rc.1.24431.7&#8221;));<br \/>\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;, &#8220;NuGetReferences&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private IServiceProvider _serviceProvider = new ServiceCollection().BuildServiceProvider();<\/p>\n<p>    [Benchmark]<br \/>\n    public MyClass Create() =&gt; ActivatorUtilities.CreateInstance&lt;MyClass&gt;(_serviceProvider, 1, 2, 3);<\/p>\n<p>    public class MyClass<br \/>\n    {<br \/>\n        public MyClass() { }<br \/>\n        public MyClass(int a) { }<br \/>\n        public MyClass(int a, int b) { }<br \/>\n        [ActivatorUtilitiesConstructor]<br \/>\n        public MyClass(int a, int b, int c) { }<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>Create<br \/>\n.NET 8.0<br \/>\n163.60 ns<br \/>\n1.00<br \/>\n288 B<br \/>\n1.00<\/p>\n<p>Create<br \/>\n.NET 9.0<br \/>\n83.46 ns<br \/>\n0.51<br \/>\n144 B<br \/>\n0.50<\/p>\n<p>The aforementioned ConstructorInvoker, along with a MethodInvoker, was introduced in .NET 8 as a way to cache first-use information to enable all subsequent operations to be much faster. Without introducing a new public FieldInvoker, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98199\">dotnet\/runtime#98199<\/a> is able to achieve similar levels of speedup for field access via a FieldInfo by employing an internal FieldAccessor that\u2019s cached onto the FieldInfo object (<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92512\">dotnet\/runtime#92512<\/a> also helped here by moving some native runtime implementations back up into C#). Varying levels of large speedups are achieved depending on the exact nature of the field being accessed.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Reflection;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;, &#8220;NuGetReferences&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private static object s_staticReferenceField = new object();<br \/>\n    private object _instanceReferenceField = new object();<br \/>\n    private static int s_staticValueField = 1;<br \/>\n    private int _instanceValueField = 2;<br \/>\n    private object _obj = new();<\/p>\n<p>    private FieldInfo _staticReferenceFieldInfo = typeof(Tests).GetField(nameof(s_staticReferenceField), BindingFlags.NonPublic | BindingFlags.Static)!;<br \/>\n    private FieldInfo _instanceReferenceFieldInfo = typeof(Tests).GetField(nameof(_instanceReferenceField), BindingFlags.NonPublic | BindingFlags.Instance)!;<br \/>\n    private FieldInfo _staticValueFieldInfo = typeof(Tests).GetField(nameof(s_staticValueField), BindingFlags.NonPublic | BindingFlags.Static)!;<br \/>\n    private FieldInfo _instanceValueFieldInfo = typeof(Tests).GetField(nameof(_instanceValueField), BindingFlags.NonPublic | BindingFlags.Instance)!;<\/p>\n<p>    [Benchmark] public object? GetStaticReferenceField() =&gt; _staticReferenceFieldInfo.GetValue(null);<br \/>\n    [Benchmark] public void SetStaticReferenceField() =&gt; _staticReferenceFieldInfo.SetValue(null, _obj);<\/p>\n<p>    [Benchmark] public object? GetInstanceReferenceField() =&gt; _instanceReferenceFieldInfo.GetValue(this);<br \/>\n    [Benchmark] public void SetInstanceReferenceField() =&gt; _instanceReferenceFieldInfo.SetValue(this, _obj);<\/p>\n<p>    [Benchmark] public int GetStaticValueField() =&gt; (int)_staticValueFieldInfo.GetValue(null)!;<br \/>\n    [Benchmark] public void SetStaticValueField() =&gt; _staticValueFieldInfo.SetValue(null, 3);<\/p>\n<p>    [Benchmark] public int GetInstanceValueField() =&gt; (int)_instanceValueFieldInfo.GetValue(this)!;<br \/>\n    [Benchmark] public void SetInstanceValueField() =&gt; _instanceValueFieldInfo.SetValue(this, 4);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>GetStaticReferenceField<br \/>\n.NET 8.0<br \/>\n24.839 ns<br \/>\n1.00<\/p>\n<p>GetStaticReferenceField<br \/>\n.NET 9.0<br \/>\n1.720 ns<br \/>\n0.07<\/p>\n<p>SetStaticReferenceField<br \/>\n.NET 8.0<br \/>\n41.025 ns<br \/>\n1.00<\/p>\n<p>SetStaticReferenceField<br \/>\n.NET 9.0<br \/>\n6.964 ns<br \/>\n0.17<\/p>\n<p>GetInstanceReferenceField<br \/>\n.NET 8.0<br \/>\n29.595 ns<br \/>\n1.00<\/p>\n<p>GetInstanceReferenceField<br \/>\n.NET 9.0<br \/>\n5.960 ns<br \/>\n0.20<\/p>\n<p>SetInstanceReferenceField<br \/>\n.NET 8.0<br \/>\n31.753 ns<br \/>\n1.00<\/p>\n<p>SetInstanceReferenceField<br \/>\n.NET 9.0<br \/>\n9.577 ns<br \/>\n0.30<\/p>\n<p>GetStaticValueField<br \/>\n.NET 8.0<br \/>\n43.847 ns<br \/>\n1.00<\/p>\n<p>GetStaticValueField<br \/>\n.NET 9.0<br \/>\n36.011 ns<br \/>\n0.82<\/p>\n<p>SetStaticValueField<br \/>\n.NET 8.0<br \/>\n39.462 ns<br \/>\n1.00<\/p>\n<p>SetStaticValueField<br \/>\n.NET 9.0<br \/>\n10.396 ns<br \/>\n0.26<\/p>\n<p>GetInstanceValueField<br \/>\n.NET 8.0<br \/>\n45.125 ns<br \/>\n1.00<\/p>\n<p>GetInstanceValueField<br \/>\n.NET 9.0<br \/>\n39.104 ns<br \/>\n0.87<\/p>\n<p>SetInstanceValueField<br \/>\n.NET 8.0<br \/>\n36.664 ns<br \/>\n1.00<\/p>\n<p>SetInstanceValueField<br \/>\n.NET 9.0<br \/>\n13.571 ns<br \/>\n0.37<\/p>\n<p>Of course, if you can avoid using these more expensive reflection approaches in the first place, that\u2019s very desirable. One reason for using reflection is to access private members of other types, and while that\u2019s a scary thing to do and generally something to be avoided, there are valid cases for it where having an efficient solution is highly desirable. .NET 8 added such a mechanism in [UnsafeAccessor], which enables a type to declare a method that effectively serves as direct access to a member of another type. So, for example, with this:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Reflection;<br \/>\nusing System.Runtime.CompilerServices;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;, &#8220;NuGetReferences&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private MyClass _myClass = new MyClass(new List&lt;int&gt;() { 1, 2, 3 });<br \/>\n    private FieldInfo _fieldInfo = typeof(MyClass).GetField(&#8220;_list&#8221;, BindingFlags.NonPublic | BindingFlags.Instance)!;<\/p>\n<p>    private static class Accessors<br \/>\n    {<br \/>\n        [UnsafeAccessor(UnsafeAccessorKind.Field, Name = &#8220;_list&#8221;)]<br \/>\n        public static extern ref object GetList(MyClass myClass);<br \/>\n    }<\/p>\n<p>    [Benchmark(Baseline = true)]<br \/>\n    public object WithFieldInfo() =&gt; _fieldInfo.GetValue(_myClass)!;<\/p>\n<p>    [Benchmark]<br \/>\n    public object WithUnsafeAccessor() =&gt; Accessors.GetList(_myClass);<br \/>\n}<\/p>\n<p>public class MyClass(object list)<br \/>\n{<br \/>\n    private object _list = list;<br \/>\n}<\/p>\n<p>I get this:<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>WithFieldInfo<br \/>\n.NET 8.0<br \/>\n27.5299 ns<br \/>\n1.00<\/p>\n<p>WithFieldInfo<br \/>\n.NET 9.0<br \/>\n4.0789 ns<br \/>\n0.15<\/p>\n<p>WithUnsafeAccessor<br \/>\n.NET 8.0<br \/>\n0.5005 ns<br \/>\n0.02<\/p>\n<p>WithUnsafeAccessor<br \/>\n.NET 9.0<br \/>\n0.5499 ns<br \/>\n0.02<\/p>\n<p>However, in .NET 8, this mechanism could only be used with non-generic members. Now in .NET 9, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99468\">dotnet\/runtime#99468<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/issues\/99830\">dotnet\/runtime#99830<\/a>, this capability now extends to generics, as well.<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Reflection;<br \/>\nusing System.Runtime.CompilerServices;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;, &#8220;NuGetReferences&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private MyClass&lt;int&gt; _myClass = new MyClass&lt;int&gt;(new List&lt;int&gt;() { 1, 2, 3 });<br \/>\n    private FieldInfo _fieldInfo = typeof(MyClass&lt;int&gt;).GetField(&#8220;_list&#8221;, BindingFlags.NonPublic | BindingFlags.Instance)!;<br \/>\n    private static class Accessors&lt;T&gt;<br \/>\n    {<br \/>\n        [UnsafeAccessor(UnsafeAccessorKind.Field, Name = &#8220;_list&#8221;)]<br \/>\n        public static extern ref List&lt;T&gt; GetList(MyClass&lt;T&gt; myClass);<br \/>\n    }<\/p>\n<p>    [Benchmark(Baseline = true)]<br \/>\n    public List&lt;int&gt; WithFieldInfo() =&gt; (List&lt;int&gt;)_fieldInfo.GetValue(_myClass)!;<\/p>\n<p>    [Benchmark]<br \/>\n    public List&lt;int&gt; WithUnsafeAccessor() =&gt; Accessors&lt;int&gt;.GetList(_myClass);<br \/>\n}<\/p>\n<p>public class MyClass&lt;T&gt;(List&lt;T&gt; list)<br \/>\n{<br \/>\n    private List&lt;T&gt; _list = list;<br \/>\n}<\/p>\n<p>Method<br \/>\nMean<br \/>\nRatio<\/p>\n<p>WithFieldInfo<br \/>\n4.4251 ns<br \/>\n1.00<\/p>\n<p>WithUnsafeAccessor<br \/>\n0.5147 ns<br \/>\n0.12<\/p>\n<p>Parsing that occurs as part of reflection, and in particular as part of type names, was also improved as part of some work to consolidate type name parsing into a reusable component. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100094\">dotnet\/runtime#100094<\/a>\u2018s primary purpose wasn\u2019t to improve performance, but it ended up doing so, anyway.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public Type? Parse() =&gt;<br \/>\n        Type.GetType(&#8220;System.Collections.Generic.Dictionary`2[&#8221; +<br \/>\n                         &#8220;[System.Collections.Generic.List`1[&#8221; +<br \/>\n                            &#8220;[System.Int32, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], &#8221; +<br \/>\n                            &#8220;System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],&#8221; +<br \/>\n                         &#8220;[System.Collections.Generic.List`1[&#8221; +<br \/>\n                            &#8220;[System.String, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], &#8221; +<br \/>\n                            &#8220;System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], &#8221; +<br \/>\n                         &#8220;System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e&#8221;);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>Parse<br \/>\n.NET 8.0<br \/>\n7.590 us<br \/>\n1.00<br \/>\n5.03 KB<br \/>\n1.00<\/p>\n<p>Parse<br \/>\n.NET 9.0<br \/>\n6.361 us<br \/>\n0.84<br \/>\n4.73 KB<br \/>\n0.94<\/p>\n<p>And then there are more intrinsics. In compiler speak, an \u201cintrinsic\u201d is something the compiler has \u201cintrinsic\u201d knowledge of, a fancy way of saying it\u2019s something the compiler implicitly knows about. This typically manifests as a method whose implementation is provided by the compiler, sometimes always or sometimes based on certain conditions. For example, string.Equals is attributed as [Intrinsic]: it has its own fully-functional implementation, but if the JIT sees that at least one of the inputs is a constant string, the JIT may emit its own optimized implementation for Equals that unrolls and vectorizes the comparison based on the exact value being compared.<\/p>\n<p>Several new members became intrinsics in .NET 9. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96226\">dotnet\/runtime#96226<\/a> turns typeof(T).IsPrimitive into an intrinsic, which allows the JIT to supply a constant replacement for the expression, which in turn allows branches to be eliminated and possibly whole swaths of then dead code to follow. For example, as part of its code path for moving to the next value, Parallel.ForAsync has a code path that looks like this:<\/p>\n<p>if (typeof(T).IsPrimitive)<br \/>\n{<br \/>\n    UseInterlockedCompareExchangeToAdvance();<br \/>\n}<br \/>\nelse<br \/>\n{<br \/>\n    UseALockAroundAReadIncrementStoreToAdvance();<br \/>\n}<\/p>\n<p>With IsPrimitive as an intrinsic, that if\/else will reduce entirely to either:<\/p>\n<p>UseInterlockedCompareExchangeToAdvance();<\/p>\n<p>or<\/p>\n<p>UseALockAroundAReadIncrementStoreToAdvance();<\/p>\n<p>based on the nature of T.<\/p>\n<p>typeof(T).IsGenericType and typeof(T).GetGenericTypeDefinition were also made into intrinsics, by <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99555\">dotnet\/runtime#99555<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103528\">dotnet\/runtime#103528<\/a>, respectively. Imagine code like that in ASP.NET where it wants to special-case APIs that return Task&lt;T&gt; vs ValueTask&lt;T&gt; vs IAsyncEnumerable&lt;T&gt; vs T vs other types; it\u2019ll often use members like IsGenericType and GetGenericTypeDefinition (which will throw an exception if IsGenericType is false) to determine whether a concrete instantiation of a generic type is the one in question. With this benchmark:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public bool Test() =&gt; IsTaskT&lt;Task&lt;string&gt;&gt;();<\/p>\n<p>    private static bool IsTaskT&lt;T&gt;() =&gt;<br \/>\n        typeof(T).IsGenericType &amp;&amp;<br \/>\n        typeof(T).GetGenericTypeDefinition() == typeof(Task&lt;&gt;);<br \/>\n}<\/p>\n<p>on .NET 8 we end up with over 250 bytes of assembly code for implementing this operation. On .NET 9, we get just this:<\/p>\n<p>; Tests.Test()<br \/>\n       mov       eax,1<br \/>\n       ret<br \/>\n; Total bytes of code 6<\/p>\n<p>The magic of intrinsics.<\/p>\n<h2>Numerics<\/h2>\n<h3>Primitive Types<\/h3>\n<p>The core data types in .NET sit at the very bottom of the stack and are used everywhere. It\u2019s thus a desire every release to whittle away at any overheads we can avoid. .NET 9 is no exception, where a multitude of PRs have gone into reducing overheads of various operations on these core types.<\/p>\n<p>Consider DateTime. When it comes to performance optimization, we typically focus on the happy path, on the \u201chot path,\u201d on the successful path. Exceptions already add significant expense to error paths, and are intended to be \u201cexceptional\u201d and relatively rare, and so we generally don\u2019t worry about an extra operation here or an extra allocation there. But, sometimes, one type\u2019s error path is another type\u2019s success path. This is especially true with Try methods, where failure is conveyed via a bool rather than with an expensive exception. As part of profiling a commonly-used .NET library, the profiler highlighted some unexpected allocations coming from DateTime handling, unexpected because we\u2019ve spent a lot of time over the years trying to eliminate allocations in this area of the code. The allocation, it turned out, was occurring on an error path, both with DateTime.Parse when an exception would be thrown, but <em>also<\/em> with DateTime.TryParse when false would be returned. As it happened, deep in the call tree where parsing work is going on, if an error is encountered, the code stores information about the failure (e.g. a ParseFailureKind enum value); after unwinding the call stack back to the public method, Parse uses that to throw an appropriately-detailed exception, while TryParse just ignores it and returns false. But the way the code was written, that enum value would end up getting boxed when it was stored, resulting in an allocation as part of TryParse returning false. The consuming library was using TryParse on a bunch of different data primitive types as part of interpreting the data, e.g.<\/p>\n<p>if (int.TryParse(value, out int parsedInt32)) { &#8230; }<br \/>\nelse if (DateTime.TryParse(value, out DateTime parsedDateTime)) { &#8230; }<br \/>\nelse if (double.TryParse(value, out double parsedDouble)) { &#8230; }<br \/>\nelse if &#8230;<\/p>\n<p>such that <em>its<\/em> success path might include the failure path from some number of these primitives\u2019 TryParse methods. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/91303\">dotnet\/runtime#91303<\/a> tweaked how the information is stored to avoid that boxing, while also reducing a bit of additional overhead along the way.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;, &#8220;input&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    [Arguments(&#8220;hello&#8221;)]<br \/>\n    public bool TryParse(string input) =&gt; DateTime.TryParse(input, out _);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>TryParse<br \/>\n.NET 8.0<br \/>\n31.95 ns<br \/>\n1.00<br \/>\n24 B<br \/>\n1.00<\/p>\n<p>TryParse<br \/>\n.NET 9.0<br \/>\n25.96 ns<br \/>\n0.81<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>Both DateTime and TimeSpan also saw parsing and formatting gains from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101640\">dotnet\/runtime#101640<\/a> from <a href=\"https:\/\/github.com\/lilinus\">@lilinus<\/a>. The PR takes advantage of an existing internal CountDigits helper that was optimized in .NET 8 as part of integer parsing; it employs a lookup table to compute the number of digits that will be required for a number, doing so in just a few instructions. And it replaces a switch with a lookup table as part of computing powers of ten, replacing a method like Pow10_Old in this benchmark with one more like Pow10_New:<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;, &#8220;input&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private int _pow = 3;<\/p>\n<p>    [Benchmark(Baseline = true)]<br \/>\n    public long Pow10_Old() =&gt;<br \/>\n        _pow switch<br \/>\n        {<br \/>\n            0 =&gt; 1,<br \/>\n            1 =&gt; 10,<br \/>\n            2 =&gt; 100,<br \/>\n            3 =&gt; 1000,<br \/>\n            4 =&gt; 10000,<br \/>\n            5 =&gt; 100000,<br \/>\n            6 =&gt; 1000000,<br \/>\n            _ =&gt; 10000000, \/\/ _pow will never be greater than 7<br \/>\n        };<\/p>\n<p>    [Benchmark]<br \/>\n    public long Pow10_New()<br \/>\n    {<br \/>\n        ReadOnlySpan&lt;int&gt; powersOfTen =<br \/>\n        [<br \/>\n            1,<br \/>\n            10,<br \/>\n            100,<br \/>\n            1000,<br \/>\n            10000,<br \/>\n            100000,<br \/>\n            1000000,<br \/>\n            10000000, \/\/ _pow will never be greater than 7<br \/>\n        ];<br \/>\n        return powersOfTen[_pow];<br \/>\n    }<br \/>\n}<\/p>\n<p>The JIT is able to do a bit better job with the latter, for the former producing:<\/p>\n<p>; Tests.Pow10_Old()<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\nM00_L00:<br \/>\n       mov       ecx,[rdi+8]<br \/>\n       cmp       ecx,3<br \/>\n       jne       short M00_L02<br \/>\n       mov       edx,3E8<br \/>\nM00_L01:<br \/>\n       movsxd    rax,edx<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM00_L02:<br \/>\n       cmp       ecx,6<br \/>\n       ja        short M00_L03<br \/>\n       mov       edx,ecx<br \/>\n       lea       rax,[7F3D29A690E8]<br \/>\n       mov       eax,[rax+rdx*4]<br \/>\n       lea       rcx,[M00_L00]<br \/>\n       add       rax,rcx<br \/>\n       jmp       rax<br \/>\nM00_L03:<br \/>\n       mov       edx,989680<br \/>\n       jmp       short M00_L01<br \/>\n       mov       edx,0F4240<br \/>\n       jmp       short M00_L01<br \/>\n       mov       edx,186A0<br \/>\n       jmp       short M00_L01<br \/>\n       mov       edx,2710<br \/>\n       jmp       short M00_L01<br \/>\n       mov       edx,64<br \/>\n       jmp       short M00_L01<br \/>\n       mov       edx,0A<br \/>\n       jmp       short M00_L01<br \/>\n       mov       edx,1<br \/>\n       jmp       short M00_L01<br \/>\n; Total bytes of code 100<\/p>\n<p>but for the latter producing:<\/p>\n<p>; Tests.Pow10_New()<br \/>\n       push      rax<br \/>\n       mov       eax,[rdi+8]<br \/>\n       cmp       eax,8<br \/>\n       jae       short M00_L00<br \/>\n       mov       rcx,7F3CC0AE6018<br \/>\n       movsxd    rax,dword ptr [rcx+rax*4]<br \/>\n       add       rsp,8<br \/>\n       ret<br \/>\nM00_L00:<br \/>\n       call      CORINFO_HELP_RNGCHKFAIL<br \/>\n       int       3<br \/>\n; Total bytes of code 34<\/p>\n<p>The net result is a nice improvement to these operations, e.g.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private string _input = TimeSpan.FromMilliseconds(12345.6789).ToString();<\/p>\n<p>    [Benchmark]<br \/>\n    public TimeSpan Parse() =&gt; TimeSpan.Parse(_input);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Parse<br \/>\n.NET 8.0<br \/>\n137.55 ns<br \/>\n1.00<\/p>\n<p>Parse<br \/>\n.NET 9.0<br \/>\n117.78 ns<br \/>\n0.86<\/p>\n<p>Various operations on the primitive types were also improved across a plethora of PRs:<\/p>\n<p><strong>Round.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98186\">dotnet\/runtime#98186<\/a> from <a href=\"https:\/\/github.com\/MichalPetryka\">@MichalPetryka<\/a> optimized the various Math.Round and MathF.Round overloads (which are the same implementations as double.Round and float.Round).<br \/>\n\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private double _value = 12345.6789;<\/p>\n<p>    [Benchmark]<br \/>\n    public double RoundDigits() =&gt; Math.Round(_value, 2);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>RoundDigits<br \/>\n.NET 8.0<br \/>\n1.6930 ns<br \/>\n1.00<\/p>\n<p>RoundDigits<br \/>\n.NET 9.0<br \/>\n0.3496 ns<br \/>\n0.21<\/p>\n<p><strong>SinCos.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103724\">dotnet\/runtime#103724<\/a> updated Math.SinCos and MathF.SinCos to use the internal RuntimeHelpers.IsKnownConstant intrinsic. This method enables code in CoreLib to check whether the argument to the method is coming in as a constant the JIT can see, at which point the implementation might choose to do something special. In this case, the Sin and Cos functions are already capable of producing constant results for constant input, so rather than doing the normal implementation, which tries to reuse most of the computation that\u2019s shared between Sin and Cos, it instead just calls to each, knowing that the constant input for each will result in a constant output overall.<br \/>\n\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser(maxDepth: 0)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public float Sum() =&gt; SumSinCos(123.456f);<\/p>\n<p>    private float SumSinCos(float f)<br \/>\n    {<br \/>\n        (float sin, float cos) = MathF.SinCos(f);<br \/>\n        return sin + cos;<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nCode Size<\/p>\n<p>Sum<br \/>\n.NET 8.0<br \/>\n5.2719 ns<br \/>\n1.000<br \/>\n46 B<\/p>\n<p>Sum<br \/>\n.NET 9.0<br \/>\n0.0177 ns<br \/>\n0.003<br \/>\n9 B<\/p>\n<p>In cases like this, it\u2019s helpful to pay attention to the warnings BenchmarkDotNet issues:<\/p>\n<p>\/\/ * Warnings *<br \/>\nZeroMeasurement<br \/>\n  Tests.Sum: Runtime=.NET 9.0, Toolchain=net9.0 -&gt; The method duration is indistinguishable from the empty method duration<\/p>\n<p>The .NET 9 run is indistinguishable from an empty method because it <em>is<\/em> an empty method, or at least a method that just returns a constant. We can see that by looking at the disassembly. The .NET 8 code has a few moves and loads and then calls to SinCos:<\/p>\n<p>; Tests.Sum()<br \/>\n       push      rax<br \/>\n       vzeroupper<br \/>\n       vmovss    xmm0,dword ptr [7F7686979610]<br \/>\n       lea       rdi,[rsp+4]<br \/>\n       lea       rsi,[rsp]<br \/>\n       call      System.MathF.SinCos(Single, Single*, Single*)<br \/>\n       vmovss    xmm0,dword ptr [rsp+4]<br \/>\n       vmovss    xmm1,dword ptr [rsp]<br \/>\n       vaddss    xmm0,xmm0,xmm1<br \/>\n       add       rsp,8<br \/>\n       ret<br \/>\n; Total bytes of code 46<\/p>\n<p>In contrast, here\u2019s .NET 9:<\/p>\n<p>; Tests.Sum()<br \/>\n       vmovss    xmm0,dword ptr [7F4D10FD9080]<br \/>\n       ret<br \/>\n; Total bytes of code 9<\/p>\n<p>It\u2019s simply loading a value and returning it, since the whole operation compiled down to a constant.<\/p>\n<p><strong>Enum.{Try}Parse.<\/strong> Interop scenarios drove the introduction of two new RuntimeHelpers APIs, SizeOf in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100618\">dotnet\/runtime#100618<\/a> and Box in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100561\">dotnet\/runtime#100561<\/a>. But, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100846\">dotnet\/runtime#100846<\/a> was then able to utilize these APIs to optimize the implementation of the non-generic Enum.Parse and Enum.TryParse overloads, which give back the parsed enum value as object. This is a special kind of boxing, because the parse methods internally extract a numerical value but then need the boxed object to be of the enum type (rather than the numerical type) specified via the Type argument.<br \/>\n\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private string _input = &#8220;Monday&#8221;;<\/p>\n<p>    [Benchmark]<br \/>\n    public object Parse() =&gt; Enum.Parse(typeof(DayOfWeek), _input);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Parse<br \/>\n.NET 8.0<br \/>\n62.01 ns<br \/>\n1.00<\/p>\n<p>Parse<br \/>\n.NET 9.0<br \/>\n28.13 ns<br \/>\n0.45<\/p>\n<p><strong>Integer Division.<\/strong> Consider this benchmark:<br \/>\n\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    [Arguments(5)]<br \/>\n    public uint DivideBy4_UInt32(uint value) =&gt; value \/ 4;<\/p>\n<p>    [Benchmark]<br \/>\n    [Arguments(5)]<br \/>\n    public int DivideBy4_Int32(int value) =&gt; value \/ 4;<br \/>\n}<\/p>\n<p>With the uint-based example, dividing by 4 is already optimized into a simple right shift, since for a uint, value \/ 4 and value &gt;&gt; 2 are functionally equivalent. However, that\u2019s not the case for an int, or at least, not always. For a non-negative int, the same optimization could be employed, but if the int is negative, for some values switching from value \/ 4 to value &gt;&gt; 2 would be functionally incorrect. Consider -5 \/ 4\u2026 the answer is -1. But -5 &gt;&gt; 2 is -2. Oops. So when you look at the assembly code for the int case (here on .NET 8), it\u2019s more complex:<\/p>\n<p>; Tests.DivideBy4_UInt32(UInt32)<br \/>\n       mov       eax,esi<br \/>\n       shr       eax,2<br \/>\n       ret<br \/>\n; Total bytes of code 6<\/p>\n<p>; Tests.DivideBy4_Int32(Int32)<br \/>\n       mov       eax,esi<br \/>\n       sar       eax,1F<br \/>\n       and       eax,3<br \/>\n       add       eax,esi<br \/>\n       sar       eax,2<br \/>\n       ret<br \/>\n; Total bytes of code 14<\/p>\n<p>Given that, you might hope that if the compiler could prove that the int was non-negative, it could still employ the simpler shifting:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    [Arguments(5)]<br \/>\n    public int DivideBy4_Int32(int value) =&gt; value &lt; 4 ? 0 : value \/ 4;<br \/>\n}<\/p>\n<p>But alas, on .NET 8, we still get:<\/p>\n<p>; Tests.DivideBy4_Int32(Int32)<br \/>\n       cmp       esi,4<br \/>\n       jl        short M00_L00<br \/>\n       mov       eax,esi<br \/>\n       sar       eax,1F<br \/>\n       and       eax,3<br \/>\n       add       eax,esi<br \/>\n       sar       eax,2<br \/>\n       ret<br \/>\nM00_L00:<br \/>\n       xor       eax,eax<br \/>\n       ret<br \/>\n; Total bytes of code 22<\/p>\n<p>On .NET 9, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/94347\">dotnet\/runtime#94347<\/a> updates the JIT for exactly that, replacing signed division with unsigned division if it can prove that both the numerator and denominator are non-negative.<\/p>\n<p>; Tests.DivideBy4_Int32(Int32)<br \/>\n       cmp       esi,4<br \/>\n       jl        short M00_L00<br \/>\n       mov       eax,esi<br \/>\n       shr       eax,2<br \/>\n       ret<br \/>\nM00_L00:<br \/>\n       xor       eax,eax<br \/>\n       ret<br \/>\n; Total bytes of code 14<\/p>\n<p><strong>Nullable.<\/strong> Several optimizations went into making Nullable&lt;T&gt; cheaper, in particular when used with generics. Consider this benchmark:<br \/>\n\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Runtime.CompilerServices;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public void TestStruct() =&gt; Dispose&lt;List.Enumerator&gt;(default);<\/p>\n<p>    [Benchmark]<br \/>\n    public void TestNullableStruct() =&gt; Dispose&lt;List&lt;int&gt;.Enumerator?&gt;(default);<\/p>\n<p>    [MethodImpl(MethodImplOptions.AggressiveInlining)]<br \/>\n    private static void Dispose&lt;T&gt;(T t)<br \/>\n    {<br \/>\n        if (t is IDisposable)<br \/>\n        {<br \/>\n            ((IDisposable)t).Dispose();<br \/>\n        }<br \/>\n    }<br \/>\n}<\/p>\n<p>We have an unconstrained generic method Dispose whose job it is to cast the argument to IDisposable and invoke its Dispose. While such an operation would seemingly box if T were a value type, for a long time now the JIT has had optimizations that end up eliminating that boxing. In the case of the List&lt;T&gt;.Enumerator, its Dispose implementation is a nop, so with Dispose&lt;T&gt; getting inlined, no boxing, and the IDisposable.Dispose implementation nop\u2019ing, this whole method is a nop (on both .NET 8 and .NET 9):<\/p>\n<p>; Tests.TestStruct()<br \/>\n       ret<br \/>\n; Total bytes of code 1<\/p>\n<p>That\u2019s unfortunately not the case for TestNullableStruct. The <em>only<\/em> difference between TestStruct and TestNullableStruct is that pesky ? in the generic type argument, which means T will be a Nullable&lt;List&lt;int&gt;.Enumerator&gt; rather than List&lt;int&gt;.Enumerator. That complicates things. Nullable&lt;T&gt; is very special, with a boxed nullable implementing the same interfaces as does the underlying struct, but it ends up being very hard for the JIT to deal with. On .NET 8, we end up with this assembly:<\/p>\n<p>; Tests.TestNullableStruct()<br \/>\n       push      rbp<br \/>\n       sub       rsp,20<br \/>\n       lea       rbp,[rsp+20]<br \/>\n       vxorps    xmm8,xmm8,xmm8<br \/>\n       vmovdqa   xmmword ptr [rbp-20],xmm8<br \/>\n       vmovdqa   xmmword ptr [rbp-10],xmm8<br \/>\n       cmp       byte ptr [rbp-20],0<br \/>\n       jne       short M00_L01<br \/>\nM00_L00:<br \/>\n       add       rsp,20<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM00_L01:<br \/>\n       lea       rsi,[rbp-20]<br \/>\n       mov       rdi,offset MT_System.Nullable`1[[System.Collections.Generic.List`1+Enumerator[[System.Int32, System.Private.CoreLib]], System.Private.CoreLib]]<br \/>\n       call      CORINFO_HELP_BOX_NULLABLE<br \/>\n       mov       rsi,rax<br \/>\n       mov       rdi,offset MT_System.IDisposable<br \/>\n       call      qword ptr [7F178F2543C0]; System.Runtime.CompilerServices.CastHelpers.ChkCastInterface(Void*, System.Object)<br \/>\n       mov       rdi,rax<br \/>\n       mov       r11,7F178E5904C0<br \/>\n       call      qword ptr [r11]<br \/>\n       jmp       short M00_L00<br \/>\n; Total bytes of code 93<\/p>\n<p>That\u2019s a whole lot more than a ret. Thankfully, for .NET 9, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/95764\">dotnet\/runtime#95764<\/a> makes this better by optimizing castclass for Nullable&lt;T&gt;:<\/p>\n<p>; Tests.TestNullableStruct()<br \/>\n       sub       rsp,28<br \/>\n       xor       eax,eax<br \/>\n       mov       [rsp+8],rax<br \/>\n       vxorps    xmm8,xmm8,xmm8<br \/>\n       vmovdqa   xmmword ptr [rsp+10],xmm8<br \/>\n       mov       [rsp+20],rax<br \/>\n       cmp       byte ptr [rsp+8],0<br \/>\n       jne       short M00_L01<br \/>\nM00_L00:<br \/>\n       add       rsp,28<br \/>\n       ret<br \/>\nM00_L01:<br \/>\n       lea       rsi,[rsp+8]<br \/>\n       mov       rdi,offset MT_System.Nullable`1[[System.Collections.Generic.List`1+Enumerator[[System.Int32, System.Private.CoreLib]], System.Private.CoreLib]]<br \/>\n       call      CORINFO_HELP_BOX_NULLABLE<br \/>\n       movsx     rax,byte ptr [rax+8]<br \/>\n       jmp       short M00_L00<br \/>\n; Total bytes of code 66<\/p>\n<p>We still have the call to CORINFO_HELP_BOX_NULLABLE, but the relatively expensive call to ChkCastInterface is now gone. While this may seem a little corner case, it actually shows up in well-known places. For example:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private int? _value = 42;<\/p>\n<p>    [Benchmark]<br \/>\n    public string Interpolate() =&gt; $&#8221;{_value}&#8221;;<br \/>\n}<\/p>\n<p>Here we\u2019re just doing string interpolation, using a nullable value type as one of the arguments. The DefaultInterpolatedStringHandler has a generic AppendFormatted method which this will end up using, passing the Nullable&lt;int&gt; as its argument, and that method employs similar patterns of type testing for an interface and using it if it\u2019s available. And as a result, this optimization can have a measurable impact on such interpolated string use:<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Interpolate<br \/>\n.NET 8.0<br \/>\n78.12 ns<br \/>\n1.00<\/p>\n<p>Interpolate<br \/>\n.NET 9.0<br \/>\n62.95 ns<br \/>\n0.81<\/p>\n<p>Another Nullable&lt;T&gt;-related optimization is <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/95711\">dotnet\/runtime#95711<\/a>, which ends up avoiding boxing for some forms of type testing. Consider this:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private int? _value = 42;<\/p>\n<p>    [Benchmark]<br \/>\n    public bool Test() =&gt; IsInt(_value);<\/p>\n<p>    private static bool IsInt&lt;T&gt;(T value) =&gt; value is int;<br \/>\n}<\/p>\n<p>This should be relatively straightforward: the JIT can see that T is a Nullable&lt;int&gt;, and then whether it satisfies the type test is a question of whether the value is null or not, since if it\u2019s null, it\u2019s not an int, and if it\u2019s not null, it is an int. Unfortunately, on .NET 8, not so much:<\/p>\n<p>; Tests.Test()<br \/>\n       push      rbp<br \/>\n       sub       rsp,10<br \/>\n       lea       rbp,[rsp+10]<br \/>\n       mov       rsi,[rdi+8]<br \/>\n       mov       [rbp-8],rsi<br \/>\n       lea       rsi,[rbp-8]<br \/>\n       mov       rdi,offset MT_System.Nullable`1[[System.Int32, System.Private.CoreLib]]<br \/>\n       call      CORINFO_HELP_BOX_NULLABLE<br \/>\n       test      rax,rax<br \/>\n       je        short M00_L00<br \/>\n       mov       rcx,offset MT_System.Int32<br \/>\n       xor       edx,edx<br \/>\n       cmp       [rax],rcx<br \/>\n       cmovne    rax,rdx<br \/>\nM00_L00:<br \/>\n       test      rax,rax<br \/>\n       setne     al<br \/>\n       movzx     eax,al<br \/>\n       add       rsp,10<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\n; Total bytes of code 76<\/p>\n<p>In fact, we can see it\u2019s using CORINFO_HELP_BOX_NULLABLE to box the Nullable&lt;int&gt;, which means we actually end up with an allocation as part of this type test. And that\u2019s visible in the benchmark results:<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nCode Size<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>Test<br \/>\n.NET 8.0<br \/>\n39.1567 ns<br \/>\n1.000<br \/>\n76 B<br \/>\n24 B<br \/>\n1.00<\/p>\n<p>Test<br \/>\n.NET 9.0<br \/>\n0.0006 ns<br \/>\n0.000<br \/>\n5 B<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>On .NET 9, it ends up being what we thought it should be, a simple null check:<\/p>\n<p>; Tests.Test()<br \/>\n       movzx     eax,byte ptr [rdi+8]<br \/>\n       ret<br \/>\n; Total bytes of code 5<\/p>\n<p>where the result of the method is simply Nullable&lt;T&gt;.HasValue.<\/p>\n<p>As a small tangent since we\u2019re talking about optimizing casting, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98284\">dotnet\/runtime#98284<\/a> improves code generation for casts where the JIT can end up seeing that the object being cast is null (while you\u2019d probably never explicitly write if (null is SomeClass), you might very well write if (GetObject() is SomeClass) were GetObject() might get inlined and return null, especially if GetObject() is virtual and due to dynamic PGO a null-returning override gets inlined).<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public Tests? NullCast() =&gt; GetObj() as Tests;<\/p>\n<p>    private object? GetObj() =&gt; null;<br \/>\n}<\/p>\n<p>On .NET 8, it doesn\u2019t pay attention to whether it knows that the source will be null, but now in .NET 9, it does:<\/p>\n<p>\/\/ .NET 8<br \/>\n; Tests.NullCast()<br \/>\n       push      rax<br \/>\n       mov       rdi,offset MT_Tests<br \/>\n       xor       esi,esi<br \/>\n       call      qword ptr [7F0457E24360]; System.Runtime.CompilerServices.CastHelpers.IsInstanceOfClass(Void*, System.Object)<br \/>\n       nop<br \/>\n       add       rsp,8<br \/>\n       ret<br \/>\n; Total bytes of code 25<\/p>\n<p>\/\/ .NET 9<br \/>\n; Tests.NullCast()<br \/>\n       push      rax<br \/>\n       xor       eax,eax<br \/>\n       add       rsp,8<br \/>\n       ret<br \/>\n; Total bytes of code 8<\/p>\n<p>Back to Nullable&lt;T&gt;, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/105073\">dotnet\/runtime#105073<\/a> enables the JIT to inline the fast path of the unboxing helper that\u2019s used when extracting a Nullable&lt;T&gt; from an object. There\u2019s an CORINFO_HELP_UNBOX_NULLABLE helper function that\u2019s called to perform the unboxing (e.g. (int?)o for some object o), but the success path (where the object is either null or the boxed target type) is small and it\u2019s worth inlining that.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private object _o = 42;<\/p>\n<p>    [Benchmark]<br \/>\n    public int? Unbox() =&gt; (int?)_o;<br \/>\n}<\/p>\n<p>On .NET 8, we get the following, effectively just a call to CORINFO_HELP_UNBOX_NULLABLE:<\/p>\n<p>; Tests.Unbox()<br \/>\n       push      rax<br \/>\n       mov       rdx,[rdi+8]<br \/>\n       lea       rdi,[rsp]<br \/>\n       mov       rsi,offset MT_System.Nullable`1[[System.Int32, System.Private.CoreLib]]<br \/>\n       call      CORINFO_HELP_UNBOX_NULLABLE<br \/>\n       mov       rax,[rsp]<br \/>\n       add       rsp,8<br \/>\n       ret<br \/>\n; Total bytes of code 33<\/p>\n<p>whereas on .NET 9, we get the following, which is creating a default Nullable&lt;int&gt; if the object is null, or a Nullable&lt;int&gt; with the value from the object if it\u2019s a boxed int, or calling CORINFO_HELP_UNBOX_NULLABLE if it\u2019s something else (in which case we\u2019ll be throwing an exception shortly):<\/p>\n<p>; Tests.Unbox()<br \/>\n       push      rbp<br \/>\n       sub       rsp,10<br \/>\n       lea       rbp,[rsp+10]<br \/>\n       mov       rdx,[rdi+8]<br \/>\n       test      rdx,rdx<br \/>\n       jne       short M00_L00<br \/>\n       xor       edx,edx<br \/>\n       mov       [rbp-8],rdx<br \/>\n       jmp       short M00_L01<br \/>\nM00_L00:<br \/>\n       mov       rax,offset MT_System.Int32<br \/>\n       cmp       [rdx],rax<br \/>\n       jne       short M00_L02<br \/>\n       mov       byte ptr [rbp-8],1<br \/>\n       mov       eax,[rdx+8]<br \/>\n       mov       [rbp-4],eax<br \/>\nM00_L01:<br \/>\n       mov       rax,[rbp-8]<br \/>\n       add       rsp,10<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\nM00_L02:<br \/>\n       lea       rdi,[rbp-8]<br \/>\n       mov       rsi,offset MT_System.Nullable`1[[System.Int32, System.Private.CoreLib]]<br \/>\n       call      CORINFO_HELP_UNBOX_NULLABLE<br \/>\n       jmp       short M00_L01<br \/>\n; Total bytes of code 83<\/p>\n<p>This is one of those cases where you actually want the code to be larger, at least for the micro-benchmark, because the inlining is the purpose and is bringing in more code.<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nCode Size<\/p>\n<p>Unbox<br \/>\n.NET 8.0<br \/>\n6.014 ns<br \/>\n1.00<br \/>\n33 B<\/p>\n<p>Unbox<br \/>\n.NET 9.0<br \/>\n2.854 ns<br \/>\n0.47<br \/>\n83 B<\/p>\n<h3>BigInteger<\/h3>\n<p>Not exactly a \u201cprimitive\u201d type, but in the same ballpark, is BigInteger. As with sbyte, short, int, and long, System.Numerics.BigInteger is an IBinaryInteger&lt;&gt; and ISignedNumber&lt;&gt;. Unlike those types, which are all of a fixed bit size (8, 16, 32, and 64 bits, respectively), BigInteger can represent signed integers with any number of bits (within reason\u2026 the current representation allows up to Array.MaxLength \/ 64 bits, which means representing numbers up to 2^33,554,432\u2026 that\u2019s\u2026 big). Such large sizes brings with it performance complexities, and historically BigInteger hasn\u2019t been a beacon of high throughput. While there\u2019s still more that can be done (and in fact there are several pending PRs even as I write this), a bunch of nice changes have landed for .NET 9.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/91176\">dotnet\/runtime#91176<\/a> from <a href=\"https:\/\/github.com\/Rob-Hague\">@Rob-Hague<\/a> improved BigInteger\u2018s byte-based constructors (e.g. public BigInteger(byte[] value)) by utilizing vectorized operations from MemoryMarshal and BinaryPrimitives. In particular, a lot of the time spent in these BigInteger constructors is in walking the list of bytes, building up integers out of each grouping of four, and storing those into a destination uint[]. With spans, however, that whole operation is unnecessary and can be achieved with an optimized CopyTo operation (effectively a memcpy) with the destination just being that uint[] reinterpreted as a span of bytes.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Numerics;<br \/>\nusing System.Security.Cryptography;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private byte[] _bytes;<\/p>\n<p>    [GlobalSetup]<br \/>\n    public void Setup()<br \/>\n    {<br \/>\n        _bytes = new byte[10_000];<br \/>\n        new Random(42).NextBytes(_bytes);<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public BigInteger NewBigInteger() =&gt; new BigInteger(_bytes);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>NewBigInteger<br \/>\n.NET 8.0<br \/>\n5.886 us<br \/>\n1.00<\/p>\n<p>NewBigInteger<br \/>\n.NET 9.0<br \/>\n1.434 us<br \/>\n0.24<\/p>\n<p>Parsing is another common way of creating BigIntegers. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/95543\">dotnet\/runtime#95543<\/a> improved the performance of parsing hex and binary-formatted values (this is on top of the .NET 9 addition in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85392\">dotnet\/runtime#85392<\/a> from <a href=\"https:\/\/github.com\/lateapexearlyspeed\">@lateapexearlyspeed<\/a> that added support for the &#8220;b&#8221; format specifier for formatting and parsing BigInteger as binary). Previously, parsing would go digit-by-digit, but with the new algorithm, it parses multiple chars at the same time, using a vectorized implementation for larger inputs.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Globalization;<br \/>\nusing System.Numerics;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private string _hex = string.Create(1024, 0, (dest, _) =&gt; new Random(42).GetItems&lt;char&gt;(&#8220;0123456789abcdef&#8221;, dest));<\/p>\n<p>    [Benchmark]<br \/>\n    public BigInteger ParseHex() =&gt; BigInteger.Parse(_hex, NumberStyles.HexNumber);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>ParseHex<br \/>\n.NET 8.0<br \/>\n5,155.5 ns<br \/>\n1.00<br \/>\n5208 B<br \/>\n1.00<\/p>\n<p>ParseHex<br \/>\n.NET 9.0<br \/>\n236.8 ns<br \/>\n0.05<br \/>\n536 B<br \/>\n0.10<\/p>\n<p>This isn\u2019t the first time efforts have been made to improve BigInteger parsing. .NET 7, for example, included a change that introduced a new parsing algorithm. The previous algorithm was O(N^2) in the number of digits, and the new algorithm had a lower algorithmic complexity, but due to the constants involved was only worthwhile with a larger number of digits. Both algorithms were included, switching between them based on a cut-off of 20,000 digits. As it turns out, with more analysis, that threshold was significantly higher than it needed to be, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97101\">dotnet\/runtime#97101<\/a> from <a href=\"https:\/\/github.com\/kzrnm\">@kzrnm<\/a> lowered the threshold to a much smaller value (1233). On top of this, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97589\">dotnet\/runtime#97589<\/a> from <a href=\"https:\/\/github.com\/kzrnm\">@kzrnm<\/a> improves parsing further by a) recognizing that the multiplier being used during parsing (shifting down digits to make room for adding in the next set) includes many leading zeros that can be ignored during the operation, and b) trailing zeros when parsing powers of 10 could be calculated more efficiently.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Numerics;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private string _digits = string.Create(2000, 0, (dest, _) =&gt; new Random(42).GetItems&lt;char&gt;(&#8220;0123456789&#8221;, dest));<\/p>\n<p>    [Benchmark]<br \/>\n    public BigInteger ParseDecimal() =&gt; BigInteger.Parse(_digits);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>ParseDecimal<br \/>\n.NET 8.0<br \/>\n24.60 us<br \/>\n1.00<br \/>\n5528 B<br \/>\n1.00<\/p>\n<p>ParseDecimal<br \/>\n.NET 9.0<br \/>\n18.95 us<br \/>\n0.77<br \/>\n856 B<br \/>\n0.15<\/p>\n<p>Once you have a BigInteger, there are of course various operations you can do with it. BigInteger.Equals was improved by <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/91416\">dotnet\/runtime#91416<\/a> from <a href=\"https:\/\/github.com\/Rob-Hague\">@Rob-Hague<\/a>, which changed the implementation to use the optimized MemoryExtensions.SequenceEqual rather than walking the arrays backing each BigInteger element-by-element. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104513\">dotnet\/runtime#104513<\/a> from <a href=\"https:\/\/github.com\/Rob-Hague\">@Rob-Hague<\/a> improved BigInteger.IsPowerOfTwo by similarly replacing a manual walk of the elements with a call to ContainsAnyExcept, looking to see whether all elements after a certain point were 0.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Numerics;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private BigInteger _value1, _value2;<\/p>\n<p>    [GlobalSetup]<br \/>\n    public void Setup()<br \/>\n    {<br \/>\n        var value1 = new byte[10_000];<br \/>\n        new Random(42).NextBytes(value1);<\/p>\n<p>        _value1 = new BigInteger(value1);<br \/>\n        _value2 = new BigInteger(value1.AsSpan().ToArray());<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public bool Equals() =&gt; _value1 == _value2;<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Equals<br \/>\n.NET 8.0<br \/>\n1,110.38 ns<br \/>\n1.00<\/p>\n<p>Equals<br \/>\n.NET 9.0<br \/>\n79.80 ns<br \/>\n0.07<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92208\">dotnet\/runtime#92208<\/a> from <a href=\"https:\/\/github.com\/kzrnm\">@kzrnm<\/a> also improved BigInteger.Multiply, and in particular when multiplying a first value that\u2019s much larger than a second value.<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Numerics;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private BigInteger _value1 = BigInteger.Parse(string.Concat(Enumerable.Repeat(&#8220;1234567890&#8221;, 1000)));<br \/>\n    private BigInteger _value2 = BigInteger.Parse(string.Concat(Enumerable.Repeat(&#8220;1234567890&#8221;, 300)));<\/p>\n<p>    [Benchmark]<br \/>\n    public BigInteger MultiplyLargeSmall() =&gt; _value1 * _value2;<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>MultiplyLargeSmall<br \/>\n.NET 8.0<br \/>\n231.0 us<br \/>\n1.00<\/p>\n<p>MultiplyLargeSmall<br \/>\n.NET 9.0<br \/>\n118.8 us<br \/>\n0.51<\/p>\n<p>Lastly, in addition to parsing, BigInteger formatting also saw some improvements. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100181\">dotnet\/runtime#100181<\/a> removed various temporary buffer allocations that were occurring as part of formatting and optimized various calculations in order to reduce overheads while formatting these values.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Numerics;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private BigInteger _value = BigInteger.Parse(string.Concat(Enumerable.Repeat(&#8220;1234567890&#8221;, 300)));<br \/>\n    private char[] _dest = new char[10_000];<\/p>\n<p>    [Benchmark]<br \/>\n    public bool TryFormat() =&gt; _value.TryFormat(_dest, out _);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>TryFormat<br \/>\n.NET 8.0<br \/>\n102.49 us<br \/>\n1.00<br \/>\n7456 B<br \/>\n1.00<\/p>\n<p>TryFormat<br \/>\n.NET 9.0<br \/>\n94.52 us<br \/>\n0.92<br \/>\n\u2013<br \/>\n0.00<\/p>\n<h3>TensorPrimitives<\/h3>\n<p>Numerics has been a big focus for .NET over the last several releases. A large stable of numerical operations is now exposed on every numerical type as well as on a set of generic interfaces those types implement. But sometimes you want to perform the same operation on a set of values rather than on an individual value, and for that, we have TensorPrimitives. .NET 8 introduced the TensorPrimitive type, which provides a plethora of numerical APIs, but for spans of them rather than for individual values. For example, float has a Cosh method:<\/p>\n<p>public static float Cosh(float x);<\/p>\n<p>which provides the <a href=\"https:\/\/mathworld.wolfram.com\/HyperbolicCosine.html\">hyberbolic cosine<\/a> of one float, and a corresponding method shows up on the IHyperbolicFunctions&lt;TSelf&gt; interface:<\/p>\n<p>static abstract TSelf Cosh(TSelf x);<\/p>\n<p>TensorPrimitives then has a corresponding method, but rather than accepting one float, it accepts a span of them, and rather than returning the results, it writes the results into a provided destination span:<\/p>\n<p>public static void Cosh(ReadOnlySpan&lt;float&gt; x, Span&lt;float&gt; destination);<\/p>\n<p>In .NET 8, TensorPrimitives provided approximately 40 such methods, and only did so for float. Now in .NET 9, this has been significantly expanded. There are now over 200 overloads on TensorPrimitives, covering most of the numerical operations that are also exposed on the generic math interfaces (and some that aren\u2019t), <em>and<\/em> they\u2019re exposed using generics, such that they can work with many more data types than just float. For example, while it maintains its float-specific overload of Cosh for backwards binary compatibility, TensorPrimitives now also sports this overload:<\/p>\n<p>public static void Cosh&lt;T&gt;(ReadOnlySpan&lt;T&gt; x, Span&lt;T&gt; destination)<br \/>\n    where T : IHyperbolicFunctions&lt;T&gt;<\/p>\n<p>such that it can be used with Half, float, double, NFloat, or any custom floating-point type you might have, as long as it implements the relevant interface. Most of these operations are also vectorized, such that it\u2019s more than just a simple loop around the corresponding scalar function.<\/p>\n<p>\/\/ Add a &lt;PackageReference Include=&#8221;System.Numerics.Tensors&#8221; Version=&#8221;9.0.0&#8243; \/&gt; to the csproj.<br \/>\n\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Numerics.Tensors;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private float[] _source, _destination;<\/p>\n<p>    [GlobalSetup]<br \/>\n    public void Setup()<br \/>\n    {<br \/>\n        var r = new Random(42);<br \/>\n        _source = Enumerable.Range(0, 1024).Select(_ =&gt; (float)r.NextSingle()).ToArray();<br \/>\n        _destination = new float[1024];<br \/>\n    }<\/p>\n<p>    [Benchmark(Baseline = true)]<br \/>\n    public void ManualLoop()<br \/>\n    {<br \/>\n        ReadOnlySpan&lt;float&gt; source = _source;<br \/>\n        Span&lt;float&gt; destination = _destination;<br \/>\n        for (int i = 0; i &lt; source.Length; i++)<br \/>\n        {<br \/>\n            destination[i] = float.Cosh(source[i]);<br \/>\n        }<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public void BuiltIn()<br \/>\n    {<br \/>\n        TensorPrimitives.Cosh&lt;float&gt;(_source, _destination);<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nMean<br \/>\nRatio<\/p>\n<p>ManualLoop<br \/>\n7,804.4 ns<br \/>\n1.00<\/p>\n<p>BuiltIn<br \/>\n621.6 ns<br \/>\n0.08<\/p>\n<p>A huge number of APIs is available, most of which see similar or better gains over the simple loop. Here\u2019s what\u2019s currently available in .NET 9, all as generic methods, and with multiple overloads available for most:<\/p>\n<p>Abs, Acosh, AcosPi, Acos, AddMultiply, Add, Asinh, AsinPi, Asin, Atan2Pi, Atan2, Atanh, AtanPi, Atan, BitwiseAnd, BitwiseOr, Cbrt, Ceiling, ConvertChecked, ConvertSaturating, ConvertTruncating, ConvertToHalf, ConvertToSingle, CopySign, CosPi, Cos, Cosh, CosineSimilarity, DegreesToRadians, Distance, Divide, Dot, Exp, Exp10M1, Exp10, Exp2M1, Exp2, ExpM1, Floor, FusedMultiplyAdd, HammingDistance, HammingBitDistance, Hypot, Ieee754Remainder, ILogB, IndexOfMaxMagnitude, IndexOfMax, IndexOfMinMagnitude, IndexOfMin, LeadingZeroCount, Lerp, Log2, Log2P1, LogP1, Log, Log10P1, Log10, MaxMagnitude, MaxMagnitudeNumber, Max, MaxNumber, MinMagnitude, MinMagnitudeNumber, Min, MinNumber, MultiplyAdd, MultiplyAddEstimate, Multiply, Negate, Norm, OnesComplement, PopCount, Pow, ProductOfDifferences, ProductOfSums, Product, RadiansToDegrees, ReciprocalEstimate, ReciprocalSqrtEstimate, ReciprocalSqrt, Reciprocal, RootN, RotateLeft, RotateRight, Round, ScaleB, ShiftLeft, ShiftRightArithmetic, ShiftRightLogical, Sigmoid, SinCosPi, SinCos, Sinh, SinPi, Sin, SoftMax, Sqrt, Subtract, SumOfMagnitudes, SumOfSquares, Sum, Tanh, TanPi, Tan, TrailingZeroCount, Truncate, Xor<\/p>\n<p>The possible speedups are even more pronounced on other operations and data types; for example, here is a manual implementation of hamming distance on two input byte arrays (hamming distance is the number of elements that differ between the two inputs), and an implementation using TensorPrimitives.HammingDistance&lt;byte&gt;:<\/p>\n<p>\/\/ Add a &lt;PackageReference Include=&#8221;System.Numerics.Tensors&#8221; Version=&#8221;9.0.0&#8243; \/&gt; to the csproj.<br \/>\n\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Numerics.Tensors;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private byte[] _x, _y;<\/p>\n<p>    [GlobalSetup]<br \/>\n    public void Setup()<br \/>\n    {<br \/>\n        var r = new Random(42);<br \/>\n        _x = Enumerable.Range(0, 1024).Select(_ =&gt; (byte)r.Next(0, 256)).ToArray();<br \/>\n        _y = Enumerable.Range(0, 1024).Select(_ =&gt; (byte)r.Next(0, 256)).ToArray();<br \/>\n    }<\/p>\n<p>    [Benchmark(Baseline = true)]<br \/>\n    public int ManualLoop()<br \/>\n    {<br \/>\n        ReadOnlySpan&lt;byte&gt; source = _x;<br \/>\n        Span&lt;byte&gt; destination = _y;<br \/>\n        int count = 0;<br \/>\n        for (int i = 0; i &lt; source.Length; i++)<br \/>\n        {<br \/>\n            if (source[i] != destination[i])<br \/>\n            {<br \/>\n                count++;<br \/>\n            }<br \/>\n        }<br \/>\n        return count;<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public int BuiltIn() =&gt; TensorPrimitives.HammingDistance&lt;byte&gt;(_x, _y);<br \/>\n}<\/p>\n<p>Method<br \/>\nMean<br \/>\nRatio<\/p>\n<p>ManualLoop<br \/>\n484.61 ns<br \/>\n1.00<\/p>\n<p>BuiltIn<br \/>\n15.76 ns<br \/>\n0.03<\/p>\n<p>A slew of PRs went into making this happen. The generic method surface area was added via <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/94555\">dotnet\/runtime#94555<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97192\">dotnet\/runtime#97192<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97572\">dotnet\/runtime#97572<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101435\">dotnet\/runtime#101435<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103305\">dotnet\/runtime#103305<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104651\">dotnet\/runtime#104651<\/a>. And then many more PRs added or improved vectorization, including <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97361\">dotnet\/runtime#97361<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97623\">dotnet\/runtime#97623<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97682\">dotnet\/runtime#97682<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98281\">dotnet\/runtime#98281<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97835\">dotnet\/runtime#97835<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97846\">dotnet\/runtime#97846<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97874\">dotnet\/runtime#97874<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97999\">dotnet\/runtime#97999<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98877\">dotnet\/runtime#98877<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103214\">dotnet\/runtime#103214<\/a> from <a href=\"https:\/\/github.com\/neon-sunset\">@neon-sunset<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103820\">dotnet\/runtime#103820<\/a>.<\/p>\n<p>As part of all of this work, there was also a recognition that we had the scalar operations and we had the operations on an unbounded number of elements as part of spans, but doing the latter efficiently required effectively having the same set of operations on the various Vector128&lt;T&gt;, Vector256&lt;T&gt;, and Vector512&lt;T&gt; types, since the typical structure of one of these operations will process vectors of elements at time. As such, progress has been made towards also exposing the same set of operations on these vector types. That\u2019s been done in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104848\">dotnet\/runtime#104848<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102181\">dotnet\/runtime#102181<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103837\">dotnet\/runtime#103837<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97114\">dotnet\/runtime#97114<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96455\">dotnet\/runtime#96455<\/a>. More to come.<\/p>\n<p>Other related numerical types have also seen improvements. Quaternion multiplication was vectorized in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96624\">dotnet\/runtime#96624<\/a> by <a href=\"https:\/\/github.com\/TJHeuvel\">@TJHeuvel<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103527\">dotnet\/runtime#103527<\/a> accelerated a variety of operations on Quaternion, Plane, Vector2, Vector3, Vector4, Matrix4x4, and Matrix3x2.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Numerics;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private Quaternion _value1 = Quaternion.CreateFromYawPitchRoll(0.5f, 0.3f, 0.2f);<br \/>\n    private Quaternion _value2 = Quaternion.CreateFromYawPitchRoll(0.1f, 0.2f, 0.3f);<\/p>\n<p>    [Benchmark]<br \/>\n    public Quaternion Multiply() =&gt; _value1 * _value2;<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Multiply<br \/>\n.NET 8.0<br \/>\n3.064 ns<br \/>\n1.00<\/p>\n<p>Multiply<br \/>\n.NET 9.0<br \/>\n1.086 ns<br \/>\n0.35<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102301\">dotnet\/runtime#102301<\/a> also moves a lot of the implementation for types like Quaternion out of the JIT \/ native code into C#, something that\u2019s only possible now because of many of the other improvements discussed elsewhere in this post.<\/p>\n<h2>Strings, Arrays, Spans<\/h2>\n<h3>IndexOf<\/h3>\n<p>As previously noted in <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-8\/\">Performance Improvements in .NET 8<\/a> and earlier in this post, my single favorite performance improvement in .NET 8 came from enabling dynamic PGO. But my second favorite improvement came from the introduction of SearchValues&lt;T&gt;. SearchValues&lt;T&gt; enables optimizing searches, by pre-computing an algorithm to use when searching for a specific set of values (or for anything other than those specific values) and storing that information for later repeated use. Internally, .NET 8 included upwards of 15 different implementations that might be chosen based on the nature of the supplied data. The type was so good at what it did that it was used in over 60 places as part of the .NET 8 release. In .NET 9, it\u2019s used even more, and it gets even better, in a multitude of ways.<\/p>\n<p>The SearchValues&lt;T&gt; type is generic, so in theory it can be used for any T, but in practice, the algorithms involved need to special-case the nature of the data, and so the SearchValues.Create factory methods only enabled creating SearchValues&lt;byte&gt; and SearchValues&lt;char&gt; instances for which dedicated implementations were provided. For example, many of the previously noted uses of SearchValues&lt;T&gt; are searching for a subset of ASCII, such as this use from Regex.Escape, which enables quickly searching for all characters that require escaping:<\/p>\n<p>private static readonly SearchValues&lt;char&gt; s_metachars = SearchValues.Create(&#8220;tnfr #$()*+.?[\\^{|&#8221;);<\/p>\n<p>If you print out the name of the type of the instance returned by that Create call, as an implementation detail today you\u2019ll see something like this:<\/p>\n<p>System.Buffers.AsciiCharSearchValues`1[System.Buffers.IndexOfAnyAsciiSearcher+Default]<\/p>\n<p>That type provides a specialization of SearchValues&lt;char&gt; optimized for searching for any ASCII subset, doing so with an implementation based on the \u201cUniversal algorithm\u201d described at <a href=\"http:\/\/0x80.pl\/articles\/simd-byte-lookup.html#universal-algorithm\">http:\/\/0x80.pl\/articles\/simd-byte-lookup.html<\/a>. Essentially, the algorithm maintains an 8 by 16 bitmap, which not coincidentally is the size of ASCII (0 through 127). Each of the 128 bits in the bitmap represents whether the corresponding ASCII value is in the set. The input chars are mapped down to bytes in a way where chars greater than 127 are mapped to a value meaning no match. The lower nibble (4 bits) of the ASCII value is used to select one of the 16 bitmap rows, and the upper nibble is used to select one of the 8 bitmap columns. And the beauty of this algorithm is, on most supported platforms, there exist SIMD instructions that enable the processing of many characters concurrently as part of just a few instructions.<\/p>\n<p>So, in .NET 8, SearchValues&lt;T&gt; was only for byte and char. But, now in .NET 9, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88394\">dotnet\/runtime#88394<\/a>,<br \/>\n<a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96429\">dotnet\/runtime#96429<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96928\">dotnet\/runtime#96928<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98901\">dotnet\/runtime#98901<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98902\">dotnet\/runtime#98902<\/a>, you can also create SearchValues&lt;string&gt; instances. The string handling is different from byte and char, however. With byte, you\u2019re searching for one of a set of bytes within a span of bytes. With char, you\u2019re searching for one of a set of chars within a span of chars. But with string, SearchValues&lt;string&gt; doesn\u2019t search for one of a set of strings within a span of strings, but rather it enables searching for one of a set of strings within a span of chars. In other words, it\u2019s a multi-string search. For example, let\u2019s say you want to search some text for the ISO 8601 days of the week, and to do so in an ordinal case-insensitive manner (such that, for example, both \u201cMonday\u201d and \u201cMONDAY\u201d would match). That can now be expressed like this:<\/p>\n<p>private static readonly SearchValues&lt;string&gt; s_daysOfWeek = SearchValues.Create(<br \/>\n    [&#8220;Monday&#8221;, &#8220;Tuesday&#8221;, &#8220;Wednesday&#8221;, &#8220;Thursday&#8221;, &#8220;Friday&#8221;, &#8220;Saturday&#8221;, &#8220;Sunday&#8221;],<br \/>\n    StringComparison.OrdinalIgnoreCase);<br \/>\n&#8230;<br \/>\nReadOnlySpan&lt;char&gt; textToSearch = &#8230;;<br \/>\nint i = textToSearch.IndexOfAny(s_daysOfWeek);<\/p>\n<p>This also highlights another interesting difference from the existing byte and char support. For those types, SearchValues is purely an optimization: IndexOfAny overloads have long existed for searching for sets of T values within a larger collection of Ts (e.g. string.IndexOfAny(char[] anyOf) was introduced over two decades ago), and the SearchValues support simply makes those use cases faster (often <em>much<\/em> faster). In contrast, until .NET 9 there have not been any built-in methods for doing multi-string search, so this new support both adds such support and adds it in a way that is highly-efficient.<\/p>\n<p>But, let\u2019s say we did want to perform such a search, without that functionality existing in the core libraries. One approach is to simply walk through the input, position by position, comparing each of the target values at that location:<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private static readonly string s_input = new HttpClient().GetStringAsync(&#8220;https:\/\/gutenberg.org\/cache\/epub\/2600\/pg2600.txt&#8221;).Result;<br \/>\n    private static readonly string[] s_daysOfWeek = [&#8220;Monday&#8221;, &#8220;Tuesday&#8221;, &#8220;Wednesday&#8221;, &#8220;Thursday&#8221;, &#8220;Friday&#8221;, &#8220;Saturday&#8221;, &#8220;Sunday&#8221;];<\/p>\n<p>    [Benchmark(Baseline = true)]<br \/>\n    public bool Contains_Iterate()<br \/>\n    {<br \/>\n        ReadOnlySpan&lt;char&gt; input = s_input;<\/p>\n<p>        for (int i = 0; i &lt; input.Length; i++)<br \/>\n        {<br \/>\n            foreach (string dow in s_daysOfWeek)<br \/>\n            {<br \/>\n                if (input.Slice(i).StartsWith(dow, StringComparison.OrdinalIgnoreCase))<br \/>\n                {<br \/>\n                    return true;<br \/>\n                }<br \/>\n            }<br \/>\n        }<\/p>\n<p>        return false;<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Contains_Iterate<br \/>\n227.526 us<br \/>\n1.000<\/p>\n<p>Classic. Functional. And slow. This is doing a fair amount of work for every single character in the input, for each looping over every day name and doing a comparison. How can we do better? First, we could try making the inner loop more efficient. Rather than iterating through the strings, we could hardcode our own switch:<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private static readonly string s_input = new HttpClient().GetStringAsync(&#8220;https:\/\/gutenberg.org\/cache\/epub\/2600\/pg2600.txt&#8221;).Result;<\/p>\n<p>    [Benchmark]<br \/>\n    public bool Contains_Iterate_Switch()<br \/>\n    {<br \/>\n        ReadOnlySpan&lt;char&gt; input = s_input;<\/p>\n<p>        for (int i = 0; i &lt; input.Length; i++)<br \/>\n        {<br \/>\n            ReadOnlySpan&lt;char&gt; slice = input.Slice(i);<br \/>\n            switch ((char)(input[i] | 0x20))<br \/>\n            {<br \/>\n                case &#8216;s&#8217; when slice.StartsWith(&#8220;Sunday&#8221;, StringComparison.OrdinalIgnoreCase) || slice.StartsWith(&#8220;Saturday&#8221;, StringComparison.OrdinalIgnoreCase):<br \/>\n                case &#8216;m&#8217; when slice.StartsWith(&#8220;Monday&#8221;, StringComparison.OrdinalIgnoreCase):<br \/>\n                case &#8216;t&#8217; when slice.StartsWith(&#8220;Tuesday&#8221;, StringComparison.OrdinalIgnoreCase) || slice.StartsWith(&#8220;Thursday&#8221;, StringComparison.OrdinalIgnoreCase):<br \/>\n                case &#8216;w&#8217; when slice.StartsWith(&#8220;Wednesday&#8221;, StringComparison.OrdinalIgnoreCase):<br \/>\n                case &#8216;f&#8217; when slice.StartsWith(&#8220;Friday&#8221;, StringComparison.OrdinalIgnoreCase):<br \/>\n                    return true;<br \/>\n            }<br \/>\n        }<\/p>\n<p>        return false;<br \/>\n    }<br \/>\n}<\/p>\n<p>The main benefit of this is it makes the StartsWith calls much more efficient. Because each call is dedicated to a specific needle that the JIT can see, it can emit customized code to optimize that comparison (for context on my choice of language, \u201cneedle\u201d is often used when describing a thing being searched for, a reference to the proverbial \u201cneedle in a haystack,\u201d and thus \u201chaystack\u201d is used to describe the thing being searched). We\u2019re also reducing the number of cases in the switch by employing an ASCII casing trick; the upper case ASCII letters differ in numerical value from the lower case ASCII letters by a single bit, so we simply ensure that bit is set and then compare against only the lower case letters.<\/p>\n<p>Method<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Contains_Iterate<br \/>\n227.526 us<br \/>\n1.000<\/p>\n<p>Contains_Iterate_Switch<br \/>\n13.885 us<br \/>\n0.061<\/p>\n<p>Much better, more than a 16x improvement. What if we instead just kept things simple and searched for each individual string using the already-optimized IndexOf?<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private static readonly string s_input = new HttpClient().GetStringAsync(&#8220;https:\/\/gutenberg.org\/cache\/epub\/2600\/pg2600.txt&#8221;).Result;<br \/>\n    private static readonly string[] s_daysOfWeek = [&#8220;Monday&#8221;, &#8220;Tuesday&#8221;, &#8220;Wednesday&#8221;, &#8220;Thursday&#8221;, &#8220;Friday&#8221;, &#8220;Saturday&#8221;, &#8220;Sunday&#8221;];<\/p>\n<p>    [Benchmark]<br \/>\n    public bool Contains_ContainsEachNeedle()<br \/>\n    {<br \/>\n        ReadOnlySpan&lt;char&gt; input = s_input;<\/p>\n<p>        foreach (string dow in s_daysOfWeek)<br \/>\n        {<br \/>\n            if (input.Contains(dow, StringComparison.OrdinalIgnoreCase))<br \/>\n            {<br \/>\n                return true;<br \/>\n            }<br \/>\n        }<\/p>\n<p>        return false;<br \/>\n    }<br \/>\n}<\/p>\n<p>Nice and simple, but\u2026<\/p>\n<p>Method<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Contains_Iterate<br \/>\n227.526 us<br \/>\n1.000<\/p>\n<p>Contains_Iterate_Switch<br \/>\n13.885 us<br \/>\n0.061<\/p>\n<p>Contains_ContainsEachNeedle<br \/>\n302.330 us<br \/>\n1.329<\/p>\n<p>Ouch. On the positive side, this approach benefits from vectorization, as the Contains operation itself is vectorized to efficiently check multiple locations at once using SIMD. Unfortunately, this case is heavily impacted by the order in which we perform the search. As it turns out, most of the days of the week show up in the input text (in this case, \u201cWar and Peace\u201d), but at very different positions, and Monday doesn\u2019t show up at all. This:<\/p>\n<p>using var hc = new HttpClient();<br \/>\nvar s = await hc.GetStringAsync(&#8220;https:\/\/gutenberg.org\/cache\/epub\/2600\/pg2600.txt&#8221;);<br \/>\nConsole.WriteLine($&#8221;Length:    {s.Length}&#8221;);<br \/>\nConsole.WriteLine($&#8221;Monday:    {s.IndexOf(&#8220;Monday&#8221;, StringComparison.OrdinalIgnoreCase)}&#8221;);<br \/>\nConsole.WriteLine($&#8221;Tuesday:   {s.IndexOf(&#8220;Tuesday&#8221;, StringComparison.OrdinalIgnoreCase)}&#8221;);<br \/>\nConsole.WriteLine($&#8221;Wednesday: {s.IndexOf(&#8220;Wednesday&#8221;, StringComparison.OrdinalIgnoreCase)}&#8221;);<br \/>\nConsole.WriteLine($&#8221;Thursday:  {s.IndexOf(&#8220;Thursday&#8221;, StringComparison.OrdinalIgnoreCase)}&#8221;);<br \/>\nConsole.WriteLine($&#8221;Friday:    {s.IndexOf(&#8220;Friday&#8221;, StringComparison.OrdinalIgnoreCase)}&#8221;);<br \/>\nConsole.WriteLine($&#8221;Saturday:  {s.IndexOf(&#8220;Saturday&#8221;, StringComparison.OrdinalIgnoreCase)}&#8221;);<br \/>\nConsole.WriteLine($&#8221;Sunday:    {s.IndexOf(&#8220;Sunday&#8221;, StringComparison.OrdinalIgnoreCase)}&#8221;);<\/p>\n<p>yields this:<\/p>\n<p>Length:    3293614<br \/>\nMonday:    -1<br \/>\nTuesday:   971396<br \/>\nWednesday: 10652<br \/>\nThursday:  107470<br \/>\nFriday:    640801<br \/>\nSaturday:  1529549<br \/>\nSunday:    891753<\/p>\n<p>That means that whereas Contains_Iterate_Switch only needs to examine 10,652 positions (the position of the first \u201cWednesday\u201d) before it finds a match, Contains_ContainsEachNeedle needs to examine 3,293,614 (no match found for \u201cMonday\u201d so it\u2019ll look at everything) + 971,396 (the index of \u201cTuesday\u201d) == 4,265,010 positions before it finds a match. That\u2019s 400x as many positions to be examined as the iterative approach. Even the SIMD vectorization gains can\u2019t match that gap in amount of work to be performed.<\/p>\n<p>Ok, so what if we changed approach, and instead searched for the first letter in each word in order to quickly skip past the locations that couldn\u2019t possibly match. We could even use SearchValues&lt;char&gt; to perform that search.<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Buffers;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private static readonly string s_input = new HttpClient().GetStringAsync(&#8220;https:\/\/gutenberg.org\/cache\/epub\/2600\/pg2600.txt&#8221;).Result;<br \/>\n    private static readonly SearchValues&lt;char&gt; s_daysOfWeekFCSV = SearchValues.Create([&#8216;S&#8217;, &#8216;s&#8217;, &#8216;M&#8217;, &#8216;m&#8217;, &#8216;T&#8217;, &#8216;t&#8217;, &#8216;W&#8217;, &#8216;w&#8217;, &#8216;F&#8217;, &#8216;f&#8217;]);<\/p>\n<p>    [Benchmark]<br \/>\n    public bool Contains_IndexOfAnyFirstChars_SearchValues()<br \/>\n    {<br \/>\n        ReadOnlySpan&lt;char&gt; input = s_input;<\/p>\n<p>        int i;<br \/>\n        while ((i = input.IndexOfAny(s_daysOfWeekFCSV)) &gt;= 0)<br \/>\n        {<br \/>\n            ReadOnlySpan&lt;char&gt; slice = input.Slice(i);<br \/>\n            switch ((char)(input[i] | 0x20))<br \/>\n            {<br \/>\n                case &#8216;s&#8217; when slice.StartsWith(&#8220;Sunday&#8221;, StringComparison.OrdinalIgnoreCase) || slice.StartsWith(&#8220;Saturday&#8221;, StringComparison.OrdinalIgnoreCase):<br \/>\n                case &#8216;m&#8217; when slice.StartsWith(&#8220;Monday&#8221;, StringComparison.OrdinalIgnoreCase):<br \/>\n                case &#8216;t&#8217; when slice.StartsWith(&#8220;Tuesday&#8221;, StringComparison.OrdinalIgnoreCase) || slice.StartsWith(&#8220;Thursday&#8221;, StringComparison.OrdinalIgnoreCase):<br \/>\n                case &#8216;w&#8217; when slice.StartsWith(&#8220;Wednesday&#8221;, StringComparison.OrdinalIgnoreCase):<br \/>\n                case &#8216;f&#8217; when slice.StartsWith(&#8220;Friday&#8221;, StringComparison.OrdinalIgnoreCase):<br \/>\n                    return true;<br \/>\n            }<\/p>\n<p>            input = input.Slice(i + 1);<br \/>\n        }<\/p>\n<p>        return false;<br \/>\n    }<br \/>\n}<\/p>\n<p>In some situations, this is a very viable strategy; in fact, it\u2019s a technique often employed by Regex. In other situations, it\u2019s less appropriate. The potential problem is that letters like \u2018s\u2019 and \u2018t\u2019 are incredibly common. The characters here (\u2018s\u2019, \u2018m\u2019, \u2018t\u2019, \u2018w\u2019, and \u2018f\u2019), both upper- and lower-case variants, make up ~17% of the input text (in contrast to just the capital subset, which makes up only ~0.54%). That means that, on average, this IndexOfAny call needs to break out of its inner vectorized processing loop every six characters, which decreases the possible efficiency gains from said vectorization. Even so, this is still our best so far:<\/p>\n<p>Method<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Contains_Iterate<br \/>\n227.526 us<br \/>\n1.000<\/p>\n<p>Contains_Iterate_Switch<br \/>\n13.885 us<br \/>\n0.061<\/p>\n<p>Contains_ContainsEachNeedle<br \/>\n302.330 us<br \/>\n1.329<\/p>\n<p>Contains_IndexOfAnyFirstChars_SearchValues<br \/>\n7.151 us<br \/>\n0.031<\/p>\n<p>Now, let\u2019s try with SearchValues&lt;string&gt;:<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Buffers;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private static readonly string s_input = new HttpClient().GetStringAsync(&#8220;https:\/\/gutenberg.org\/cache\/epub\/2600\/pg2600.txt&#8221;).Result;<br \/>\n    private static readonly SearchValues&lt;string&gt; s_daysOfWeekSV = SearchValues.Create(<br \/>\n        [&#8220;Monday&#8221;, &#8220;Tuesday&#8221;, &#8220;Wednesday&#8221;, &#8220;Thursday&#8221;, &#8220;Friday&#8221;, &#8220;Saturday&#8221;, &#8220;Sunday&#8221;],<br \/>\n        StringComparison.OrdinalIgnoreCase);<\/p>\n<p>    [Benchmark]<br \/>\n    public bool Contains_StringSearchValues() =&gt;<br \/>\n        s_input.AsSpan().ContainsAny(s_daysOfWeekSV);<br \/>\n}<\/p>\n<p>The functionality is built-in, so we haven\u2019t had to write any custom logic other than the call to ContainsAny. And the results:<\/p>\n<p>Method<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Contains_Iterate<br \/>\n227.526 us<br \/>\n1.000<\/p>\n<p>Contains_Iterate_Switch<br \/>\n13.885 us<br \/>\n0.061<\/p>\n<p>Contains_ContainsEachNeedle<br \/>\n302.330 us<br \/>\n1.329<\/p>\n<p>Contains_IndexOfAnyFirstChars_SearchValues<br \/>\n7.151 us<br \/>\n0.031<\/p>\n<p>Contains_StringSearchValues<br \/>\n2.153 us<br \/>\n0.009<\/p>\n<p>Not only simpler, then, but also several times faster than the fastest result we\u2019d previously managed, and ~105x faster than our original attempt. Sweet!<\/p>\n<p>How does this all work? The algorithms behind it are quite fascinating. As with byte and char, there are multiple concrete implementations that might be employed, selected based on the exact needle values passed to Create. The simplest implementations are those for handling degenerate cases, like zero inputs (in which case all of the methods can just return hard-coded \u201cnot found\u201d results). There\u2019s also a dedicated implementation for a single input, in which case it can perform the same search as IndexOf(needle) would have done, but lifting out the choice of characters within the needle for which to perform a vectorized search. IndexOf(string) chooses a couple of characters from the needle (typically just the first and last character in the needle), creates a vector for each of those, and then with appropriate offsets based on the distance between the chosen characters, iterates through the input, comparing against those vectors, and doing a full string comparison only if both vectors match at a particular location. SearchValues&lt;string&gt; does the same thing (in an internal implementation today called SingleStringSearchValuesThreeChars), except it uses three instead of two characters, and it employs frequency analysis to choose those characters rather than simply picking the first and last, trying to use characters that are less likely to appear in general (e.g. given the string \u201camazing\u201d, it\u2019d likely pick the \u2018m\u2019, \u2018z\u2019, and \u2018g\u2019, as it deems those statistically less likely in average inputs than \u2018a\u2019, \u2018i\u2019, or \u2018n\u2019). It can take more time to do this given it can perform the computation once and then cache it for all subsequent searches. We\u2019ll refer back to this in a bit.<\/p>\n<p>Beyond those special cases, it starts to get really interesting. There\u2019s been a lot of research done over the last 50 years for the most efficient ways to perform a multi-string search. One popular algorithm is Rabin-Karp, which was created by Richard Karp and Michael Rabin in the 1980s, and which works via a \u201crolling hash.\u201d Imagine creating a hash of the first N characters in the haystack (input) text, where N is the length of the needle (the substring) for which you\u2019re searching, and comparing the haystack hash against the needle hash; if they match, do the actual full comparison at that location, otherwise continue. Then update the hash by removing the first character and adding the next character, and repeat the check. And then repeat, and repeat, and so on. Each time you move forward, you\u2019re just updating the hash via a fixed number of operations, meaning that all of the updates to the hash function for the whole operation are only O(Haystack). Best case, you only find a single location that the substring could match, and you\u2019ve got O(Haystack + Needle) algorithmic complexity. Worst case (but generally unlikely), every location is a possible match, and you\u2019ve got O(Haystack * Needle) algorithmic complexity. A simple implementation might look like this (for pedagogical purposes, this uses a terrible hash function that just sums the character\u2019s numerical values; the real algorithm recommends a better one):<\/p>\n<p>private static bool RabinKarpContains(ReadOnlySpan&lt;char&gt; haystack, ReadOnlySpan&lt;char&gt; needle)<br \/>\n{<br \/>\n    if (haystack.Length &gt;= needle.Length)<br \/>\n    {<br \/>\n        \/\/ Hash the needle and the first needle.Length chars of the haystack.<br \/>\n        \/\/ Super simple hash for pedagogical purposes: just sum the chars.<br \/>\n        int i, rollingHash = 0, needleHash = 0;<br \/>\n        for (i = 0; i &lt; needle.Length; i++)<br \/>\n        {<br \/>\n            rollingHash += haystack[i];<br \/>\n            needleHash += needle[i];<br \/>\n        }<\/p>\n<p>        while (true)<br \/>\n        {<br \/>\n            \/\/ If the hashes match, compare the strings.<br \/>\n            if (needleHash == rollingHash &amp;&amp; haystack.Slice(i &#8211; needle.Length).StartsWith(needle))<br \/>\n            {<br \/>\n                return true;<br \/>\n            }<\/p>\n<p>            \/\/ If we&#8217;ve reached the end of the haystack, break.<br \/>\n            if (i == haystack.Length)<br \/>\n            {<br \/>\n                break;<br \/>\n            }<\/p>\n<p>            \/\/ Update the rolling hash.<br \/>\n            rollingHash += haystack[i] &#8211; haystack[i &#8211; needle.Length];<br \/>\n            i++;<br \/>\n        }<br \/>\n    }<\/p>\n<p>    return needle.IsEmpty;<br \/>\n}<\/p>\n<p>This supports one needle, but extending to support multiple needles can be accomplished in a variety of ways, such as by bucketing needles by their hash codes (ala what a hash map does), and then either checking all needles in the corresponding bucket when there\u2019s a hit, or further reduction in what needs to be checked based on using a Bloom filter or similar technique. SearchValues&lt;string&gt; will utilize Rabin-Karp, but only for very short inputs, as for longer inputs there are more efficient approaches.<\/p>\n<p>Another popular algorithm is Aho-Corasick, which was designed by Alfred Aho and Margaret Corasick even earlier, in the 1970s. Its primary purpose is multi-string search, enabling finding a match to be performed in linear time in the length of the input, assuming a fixed set of needles. It works by building up a form of a trie, a finite automaton where you start at the root of the graph and transition to children based on matching the character associated with the edge to that child. But, it extends a typical trie with additional edges between nodes that can be used as fallbacks. For example, here\u2019s the automaton for the days of the week discussed earlier:<\/p>\n<p>Given the input text \u201cwednesunday\u201d, this will start at the root, progress through the \u201cw\u201d, \u201cwe\u201d, \u201cwed\u201d, \u201cwedn\u201d, \u201cwedne\u201d, and \u201cwednes\u201d nodes, but then upon encountering the subsequent \u2018u\u2019 and not being able to progress down that path, it\u2019ll employ the fallback link over to the \u201cs\u201d node, at which point it\u2019ll be able to traverse down through \u201cs\u201d, \u201csu\u201d, and so on, until it hits the leaf \u201csunday\u201d node and can declare success. Aho-Corasick efficiently supports larger strings, and is the go-to implementation SearchValues&lt;string&gt; uses as a general fallback. However, in many situations, it can do even better\u2026<\/p>\n<p>The real workhorse of SearchValues&lt;string&gt; that\u2019s chosen whenever possible is a vectorized implementation of the \u201cTeddy\u201d algorithm. This algorithm originated in Intel\u2019s <a href=\"https:\/\/github.com\/intel\/hyperscan\">Hyperscan<\/a> library, was later adopted by the Rust aho_corasick crate, and is now employed as part of SearchValues&lt;string&gt; in .NET 9. It is super cool, and super efficient.<\/p>\n<p>Earlier, I gave a rough summary of how the SingleStringSearchValuesThreeChars and IndexOfAnyAsciiSearcher implementations work. SingleStringSearchValuesThreeChars optimizes finding likely positions where a substring might start, reducing the number of false positives by checking for multiple contained characters, and then likely positions are validated by doing the full string comparison at that location. And IndexOfAnyAsciiSearcher optimizes finding the next position of any character in a large-ish set. You can think of Teddy as a combination of those. There\u2019s a really nice description of the algorithm <a href=\"https:\/\/github.com\/dotnet\/runtime\/blob\/122d97b4674681745a0c335ace2c5231b1da7a96\/src\/libraries\/System.Private.CoreLib\/src\/System\/SearchValues\/Strings\/AsciiStringSearchValuesTeddyBase.cs#L17-L94\">in the source<\/a>, so I won\u2019t go into much detail here. In summary, though, it maintains a similar bitmap as with IndexOfAnyAsciiSearcher, but instead of a single bit per ASCII character, it maintains an 8-bit bitmap for each nibble, and instead of just one bitmap, it maintains two or three, each of which corresponds to a location in the substrings (e.g. one bitmap for the 0th character and one bitmap for the 1st character). Those 8 bits in the bitmap are used to indicate which of up to 8 needles contain that nibble at that location. If there are fewer than 8 needles being searched for, then each of these bits individually identifies one of them, and if there are more than 8 needles, just as with Rabin-Karp, we can create buckets of the needle substrings, with a bit in the bitmap referring to one of the buckets. If the comparisons against the bitmaps indicates a likely match, the full match is performed against the relevant needle (or needles, in the case of matching a bucket). And as with IndexOfAnyAsciiSearcher, all of this support employs SIMD instructions to perform the lookups on chunks of input text from between 16 and 64 characters at a time, yielding significant speedups.<\/p>\n<p>SearchValues&lt;string&gt; is great for larger numbers of strings, but it\u2019s relevant even for just a few. Consider, for example, this code from MSBuild that\u2019s part of parsing build output looking for warnings and errors:<\/p>\n<p>if (message.IndexOf(&#8220;warning&#8221;, StringComparison.OrdinalIgnoreCase) == -1 &amp;&amp;<br \/>\n    message.IndexOf(&#8220;error&#8221;, StringComparison.OrdinalIgnoreCase) == -1)<br \/>\n{<br \/>\n    return null;<br \/>\n}<\/p>\n<p>Rather than doing two individual searches, we can perform a single search:<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Buffers;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private static readonly string s_input = new HttpClient().GetStringAsync(&#8220;https:\/\/gutenberg.org\/cache\/epub\/2600\/pg2600.txt&#8221;).Result;<br \/>\n    private static readonly SearchValues&lt;string&gt; s_warningError = SearchValues.Create([&#8220;warning&#8221;, &#8220;error&#8221;], StringComparison.OrdinalIgnoreCase);<\/p>\n<p>    [Benchmark(Baseline = true)]<br \/>\n    public bool TwoContains() =&gt;<br \/>\n        s_input.Contains(&#8220;warning&#8221;, StringComparison.OrdinalIgnoreCase) ||<br \/>\n        s_input.Contains(&#8220;error&#8221;, StringComparison.OrdinalIgnoreCase);<\/p>\n<p>    [Benchmark]<br \/>\n    public bool ContainsAny() =&gt;<br \/>\n        s_input.AsSpan().ContainsAny(s_warningError);<br \/>\n}<\/p>\n<p>This is searching \u201cWar and Peace\u201d for both \u201cwarning\u201d or \u201cerror\u201d, but even though both appear in the text, such that the second search for \u201cerror\u201d in the original code will never happen, the SearchValues&lt;string&gt; search ends up being faster because \u201cerror\u201d appears much earlier in the text than does \u201cwarning\u201d.<\/p>\n<p>Method<br \/>\nMean<br \/>\nRatio<\/p>\n<p>TwoContains<br \/>\n70.03 us<br \/>\n1.00<\/p>\n<p>ContainsAny<br \/>\n14.05 us<br \/>\n0.20<\/p>\n<p>Beyond SearchValues&lt;string&gt;, the existing SearchValues&lt;byte&gt; and SearchValues&lt;char&gt; support also gets a variety of boosts in .NET 9. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96588\">dotnet\/runtime#96588<\/a>, for example, makes some common SearchValues&lt;char&gt; searches faster, specifically when there are 2 or 4 characters being searched for that represent 1 or 2 ASCII case-insensitive characters, such as [&#8216;A&#8217;, &#8216;a&#8217;] or [&#8216;A&#8217;, &#8216;a&#8217;, &#8216;B&#8217;, &#8216;b&#8217;]. In .NET 8, for [&#8216;A&#8217;, &#8216;a&#8217;], for example, SearchValues.Create will end up picking an implementation that will create a vector for each of &#8216;A&#8217; and &#8216;a&#8217;, and then in the inner loop of the search, it\u2019ll compare each vector against the input haystack text. This PR teaches it to do a similar ASCII trick we discussed earlier: rather than having two separate vectors, it can have a single vector for &#8216;a&#8217;, and then do a single comparison against the input vector that\u2019s been OR\u2019d with 0x20, such that any &#8216;A&#8217;s become &#8216;a&#8217;s. The OR plus a single comparison is cheaper than the two comparisons plus the OR of the resulting comparisons. Funnily enough, this needn\u2019t even be about casing: since all we\u2019re doing is OR\u2019ing in 0x20, it applies to any two characters that differ by that same single bit.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Buffers;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private static readonly string s_input = new HttpClient().GetStringAsync(&#8220;https:\/\/www.gutenberg.org\/cache\/epub\/100\/pg100.txt&#8221;).Result;<br \/>\n    private static readonly SearchValues&lt;char&gt; s_symbols = SearchValues.Create(&#8220;@`&#8221;);<\/p>\n<p>    [Benchmark]<br \/>\n    public bool ContainsAny() =&gt; s_input.AsSpan().ContainsAny(s_symbols);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>ContainsAny<br \/>\n.NET 8.0<br \/>\n262.7 us<br \/>\n1.02<\/p>\n<p>ContainsAny<br \/>\n.NET 9.0<br \/>\n232.3 us<br \/>\n0.90<\/p>\n<p>The same thing applies with four characters: instead of doing four vector comparisons and three OR operations to combine them, we can do a single OR on the input to mix in 0x20, two vector comparisons, and a single OR to combine those results. In fact, the four-vector approach was already more expensive than the IndexOfAnyAsciiSearcher implementation previously described, and since that supports any number of ASCII characters, when applicable even for just four-character needles, SearchValues.Create would have preferred that. But now in .NET 9 with this optimization, SearchValues.Create will prefer to use this specialized two-comparison path.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Buffers;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private static readonly string s_input = new HttpClient().GetStringAsync(&#8220;https:\/\/www.gutenberg.org\/cache\/epub\/100\/pg100.txt&#8221;).Result;<br \/>\n    private static readonly SearchValues&lt;char&gt; s_symbols = SearchValues.Create(&#8220;@`^~&#8221;);<\/p>\n<p>    [Benchmark]<br \/>\n    public bool ContainsAny() =&gt; s_input.AsSpan().ContainsAny(s_symbols);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>ContainsAny<br \/>\n.NET 8.0<br \/>\n247.5 us<br \/>\n1.01<\/p>\n<p>ContainsAny<br \/>\n.NET 9.0<br \/>\n196.2 us<br \/>\n0.80<\/p>\n<p>Other SearchValues implementations also improve in .NET 9, notably the \u201cProbabilisticMap\u201d implementations. These implementations are preferred by SearchValues&lt;char&gt; as a fallback when the faster vectorized implementations aren\u2019t applicable but when the number of characters in the needle isn\u2019t exorbitant (the current limit is 256). It works via a form of Bloom filter. Effectively, it maintains a 256-bit bitmap, with needle characters mapping to one or two bits, depending on the char. If any of the bits for a given char isn\u2019t 1, then the char is definitively not in the set. If all of the bits for a given char are 1, then the char <em>may<\/em> be in the set, and a more expensive check needs to be performed to determine inclusion. Whether those bits are set is a vectorizable operation, and so as long as false positives are relatively rare (which is why there\u2019s a limit on the number of characters; the more characters are represented, the more false positives there\u2019s likely to be), it\u2019s an efficient means for doing the search. However, this vectorization only applies to positive cases (e.g. IndexOfAny \/ ContainsAny) but not negative cases (e.g. IndexOfAnyExcept \/ ContainsAnyExcept); for those \u201cExcept\u201d methods, the implementation still walks character by character, and the check it employed per character was O(Needle). Thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101001\">dotnet\/runtime#101001<\/a>, which replaces a linear search with a \u201cperfect hash,\u201d that O(Needle) drops to O(1), making such \u201cExcept\u201d calls much more efficient.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Buffers;<br \/>\nusing System.Text.RegularExpressions;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    public static readonly string s_aristotle = new HttpClient().GetStringAsync(&#8220;https:\/\/www.gutenberg.org\/cache\/epub\/39963\/pg39963.txt&#8221;).Result;<br \/>\n    public static readonly SearchValues&lt;char&gt; s_greekOrAsciiDigits = SearchValues.Create(<br \/>\n        Enumerable.Range(0, char.MaxValue + 1)<br \/>\n        .Where(i =&gt; Regex.IsMatch(((char)i).ToString(), @&#8221;[p{IsGreek}0-9]&#8221;))<br \/>\n        .Select(i =&gt; (char)i)<br \/>\n        .ToArray());<\/p>\n<p>    [Benchmark]<br \/>\n    public int CountNonGreekOrAsciiDigitsChars()<br \/>\n    {<br \/>\n        int count = 0;<\/p>\n<p>        ReadOnlySpan&lt;char&gt; text = s_aristotle;<br \/>\n        int index;<br \/>\n        while ((index = text.IndexOfAnyExcept(s_greekOrAsciiDigits)) &gt;= 0)<br \/>\n        {<br \/>\n            count++;<br \/>\n            text = text.Slice(index + 1);<br \/>\n        }<\/p>\n<p>        return count;<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>CountNonGreekOrAsciiDigitsChars<br \/>\n.NET 8.0<br \/>\n1,814.7 us<br \/>\n1.00<\/p>\n<p>CountNonGreekOrAsciiDigitsChars<br \/>\n.NET 9.0<br \/>\n881.7 us<br \/>\n0.49<\/p>\n<p>That same PR also made another significant improvement related to the probabilistic map: not using it as much. It\u2019s a terrific implementation for some sets of inputs, but for others it can end up performing poorly. .NET 8 included a Latin1CharSearchValues, which was used when some of the needle characters were non-ASCII but all were less than 256. In such cases, if the probabilistic map couldn\u2019t vectorize, SearchValues.Create would return a Latin1CharSearchValues instance, which maintained a simple 256-bit bitmap that detailed whether each character is in the needle. For .NET 9, that type has been replaced by a more general one that supports arbitrarily large bitmaps, used when there are simply too many for the probabilistic map implementation to handle well or when the values are sufficiently dense. Consider a case like this:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Buffers;<br \/>\nusing System.Text.RegularExpressions;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    public static readonly string s_markTwain = new HttpClient().GetStringAsync(&#8220;https:\/\/www.gutenberg.org\/cache\/epub\/3200\/pg3200.txt&#8221;).Result;<br \/>\n    public static readonly SearchValues&lt;char&gt; s_greekChars = SearchValues.Create(<br \/>\n        Enumerable.Range(0, char.MaxValue + 1)<br \/>\n        .Where(i =&gt; Regex.IsMatch(((char)i).ToString(), @&#8221;[p{IsGreek}p{IsGreekExtended}]&#8221;))<br \/>\n        .Select(i =&gt; (char)i)<br \/>\n        .ToArray());<\/p>\n<p>    [Benchmark]<br \/>\n    public int CountGreekChars()<br \/>\n    {<br \/>\n        int count = 0;<\/p>\n<p>        ReadOnlySpan&lt;char&gt; text = s_markTwain;<br \/>\n        int index;<br \/>\n        while ((index = text.IndexOfAny(s_greekChars)) &gt;= 0)<br \/>\n        {<br \/>\n            count++;<br \/>\n            text = text.Slice(index + 1);<br \/>\n        }<\/p>\n<p>        return count;<br \/>\n    }<br \/>\n}<\/p>\n<p>The needle here includes all of the characters in the Greek and Greek Extended Unicode blocks, approximately 400 characters. With the way the probabilistic map builds up its filter bitmap, every single bit in the bitmap ends up being set, which means every examined character will fall back to the expensive path. Now in .NET 9, it\u2019ll use a simpler, non-probabilistic bitmap, and even though it\u2019s not vectorized, it yields significantly faster throughput.<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>CountGreekChars<br \/>\n.NET 8.0<br \/>\n126.454 ms<br \/>\n1.00<\/p>\n<p>CountGreekChars<br \/>\n.NET 9.0<br \/>\n8.956 ms<br \/>\n0.07<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96931\">dotnet\/runtime#96931<\/a> also extended this probabilistic map support to benefit from AVX512 such that when the probabilistic map implementation is used, it can be significantly faster. Previously, its implementation would utilize 128-bit or 256-bit vectors, depending on hardware support, but now in .NET 9, it can also use 512-bit vectors. This not only possibly doubles throughput due to vector width, AVX512 also includes some applicable instructions that the older instruction sets don\u2019t have (e.g. VPERMB, which is exposed as Avx512Vbmi.PermuteVar64x8), enabling even faster processing due to employing those more sophisticated instructions where relevant. This ends up being particularly impactful when searching for a reasonably small number of non-ASCII characters.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Buffers;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    public static readonly string s_aristotle = new HttpClient().GetStringAsync(&#8220;https:\/\/www.gutenberg.org\/cache\/epub\/39963\/pg39963.txt&#8221;).Result;<br \/>\n    public static readonly SearchValues&lt;char&gt; s_setSymbols = SearchValues.Create(&#8220;\u2282\u2283\u2286\u2287\u2284\u2229\u222a\u2208\u220a\u2209\u220b\u220d\u220c\u2205&#8221;);<\/p>\n<p>    [Benchmark]<br \/>\n    public int Count()<br \/>\n    {<br \/>\n        int count = 0;<\/p>\n<p>        ReadOnlySpan&lt;char&gt; text = s_aristotle;<br \/>\n        int index;<br \/>\n        while ((index = text.IndexOfAny(s_setSymbols)) &gt;= 0)<br \/>\n        {<br \/>\n            count++;<br \/>\n            text = text.Slice(index + 1);<br \/>\n        }<\/p>\n<p>        return count;<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Count<br \/>\n.NET 8.0<br \/>\n28.35 us<br \/>\n1.00<\/p>\n<p>Count<br \/>\n.NET 9.0<br \/>\n13.19 us<br \/>\n0.47<\/p>\n<p>Further, while the probabilistic map implementations were previously vectorized for IndexOfAny (and therefore implicitly for ContainsAny), they weren\u2019t for LastIndexOfAny, which means just changing whether you were searching from start to end or from end to start could have a significant impact on throughput. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102331\">dotnet\/runtime#102331<\/a> improves that as well, enabling the LastIndexOfAny path to also take advantage of SIMD.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Buffers;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    public static readonly string s_markTwain = new HttpClient().GetStringAsync(&#8220;https:\/\/www.gutenberg.org\/cache\/epub\/3200\/pg3200.txt&#8221;).Result;<br \/>\n    public static readonly SearchValues&lt;char&gt; s_accentedChars = SearchValues.Create(<br \/>\n        [&#8216;\u00c0&#8217;, &#8216;\u00c8&#8217;, &#8216;\u00cc&#8217;, &#8216;\u00d2&#8217;, &#8216;\u00d9&#8217;, &#8216;\u00c1&#8217;, &#8216;\u00c9&#8217;, &#8216;\u00cd&#8217;, &#8216;\u00d3&#8217;, &#8216;\u00da&#8217;,<br \/>\n         &#8216;\u00c2&#8217;, &#8216;\u00ca&#8217;, &#8216;\u00ce&#8217;, &#8216;\u00d4&#8217;, &#8216;\u00db&#8217;, &#8216;\u00c3&#8217;, &#8216;\u1ebc&#8217;, &#8216;\u0128&#8217;, &#8216;\u00d5&#8217;, &#8216;\u0168&#8217;,<br \/>\n         &#8216;\u00c4&#8217;, &#8216;\u00cb&#8217;, &#8216;\u00cf&#8217;, &#8216;\u00d6&#8217;, &#8216;\u00dc&#8217;, &#8216;\u0178&#8217;]);<\/p>\n<p>    [Benchmark]<br \/>\n    public bool HasAnyAccented_IndexOfAny() =&gt; s_markTwain.AsSpan().IndexOfAny(s_accentedChars) &gt;= 0;<\/p>\n<p>    [Benchmark]<br \/>\n    public bool HasAnyAccented_LastIndexOfAny() =&gt; s_markTwain.AsSpan().LastIndexOfAny(s_accentedChars) &gt;= 0;<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>HasAnyAccented_IndexOfAny<br \/>\n.NET 8.0<br \/>\n7.910 ms<br \/>\n1.00<\/p>\n<p>HasAnyAccented_IndexOfAny<br \/>\n.NET 9.0<br \/>\n4.476 ms<br \/>\n0.57<\/p>\n<p>HasAnyAccented_LastIndexOfAny<br \/>\n.NET 8.0<br \/>\n17.491 ms<br \/>\n1.00<\/p>\n<p>HasAnyAccented_LastIndexOfAny<br \/>\n.NET 9.0<br \/>\n5.253 ms<br \/>\n0.30<\/p>\n<p>In many of my examples, I\u2019ve used ContainsAny rather than IndexOfAny. The former is functionally equivalent to input.IndexOfAny(searchValues) &gt;= 0, and in fact that was the entirety of the implementation in .NET 8. However, as IndexOfAny employs vectorization and is comparing multiple elements as part of the same instruction, when a match is found, there is a bit of overhead involved to then determine exactly which element matched (or if multiple matched, which match had the lowest index). ContainsAny doesn\u2019t actually need to care about the exact index: as exemplified by its implementation, it only cares about whether there was a match rather than where one was. As such, we can shave off some cycles by customizing the implementation for ContainsAny to avoid that unnecessary computation, and that\u2019s exactly what <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96924\">dotnet\/runtime#96924<\/a> does. The effects of this are most notable where that overhead would be measurable, which is when a match is found really early in the input.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Buffers;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private string _haystack = &#8220;Hello, world! How are you today?&#8221;;<br \/>\n    private static readonly SearchValues&lt;char&gt; s_vowels = SearchValues.Create(&#8220;aeiou&#8221;);<\/p>\n<p>    [Benchmark]<br \/>\n    public bool ContainsAny() =&gt; _haystack.AsSpan().ContainsAny(s_vowels);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>ContainsAny<br \/>\n.NET 8.0<br \/>\n3.640 ns<br \/>\n1.00<\/p>\n<p>ContainsAny<br \/>\n.NET 9.0<br \/>\n2.382 ns<br \/>\n0.65<\/p>\n<p>Improvements around SearchValues aren\u2019t limited just to new APIs or the implementation of the APIs; there\u2019s also been work to help developers better consume SearchValues. <a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/6898\">dotnet\/roslyn-analyzers#6898<\/a> and <a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/7252\">dotnet\/roslyn-analyzers#7252<\/a> added a new analyzer (<a href=\"https:\/\/learn.microsoft.com\/dotnet\/fundamentals\/code-analysis\/quality-rules\/ca1870\">CA1870<\/a>) that will find opportunities to use SearchValues and automatically fix the call sites to do so.<\/p>\n\n<p>It\u2019s also worth highlighting that there have been improvements around IndexOf \/ Contains in .NET 9 besides with SearchValues. One simple but interesting change is <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97632\">dotnet\/runtime#97632<\/a>. This simply added an if block to string.Contains(string):<\/p>\n<p>public bool Contains(string value)<br \/>\n{<br \/>\n    if (value == null)<br \/>\n        ThrowHelper.ThrowArgumentNullException(ExceptionArgument.value);<\/p>\n<p>    \/\/ PR added this if block<br \/>\n    if (RuntimeHelpers.IsKnownConstant(value) &amp;&amp; value.Length == 1)<br \/>\n        return Contains(value[0]);<\/p>\n<p>    return SpanHelpers.IndexOf(<br \/>\n        ref _firstChar,<br \/>\n        Length,<br \/>\n        ref value._firstChar,<br \/>\n        value.Length) &gt;= 0;<br \/>\n}<\/p>\n<p>What\u2019s interesting about this is the SpanHelpers.IndexOf it delegates to already contains a fast path that special-cases single-character strings:<\/p>\n<p>if (valueTailLength == 0)<br \/>\n{<br \/>\n    \/\/ for single-char values use plain IndexOf<br \/>\n    return IndexOfChar(ref searchSpace, value, searchSpaceLength);<br \/>\n}<\/p>\n<p>Why then is this extra if block helpful? It\u2019s taking advantage of that same internal IsKnownConstant intrinsic we saw earlier. The JIT will always compile this method down to a true\/false constant, so it ends up adding no runtime overhead. If the value is false, the whole if block evaporates. But if it\u2019s true, that necessarily means the argument passed to the method is recognized by the JIT as being a constant, e.g. a developer called someString.Contains(&#8220;-&#8220;) such that the JIT can see that value is &#8220;-&#8220;. In such a case, the JIT also knows value.Length, such that it can see at compile time whether it\u2019s 1 or not. And that in turn means this whole method becomes:<\/p>\n<p>public bool Contains(string value)<br \/>\n{<br \/>\n    if (value == null)<br \/>\n        ThrowHelper.ThrowArgumentNullException(ExceptionArgument.value);<\/p>\n<p>    return SpanHelpers.IndexOf(<br \/>\n        ref _firstChar,<br \/>\n        Length,<br \/>\n        ref value._firstChar,<br \/>\n        value.Length) &gt;= 0;<br \/>\n}<\/p>\n<p>if the JIT can\u2019t prove the argument is a constant or if it\u2019s not exactly one character in length, or:<\/p>\n<p>public bool Contains(string value)<br \/>\n{<br \/>\n    return Contains(&#8216;the constant char&#8217;);<br \/>\n}<\/p>\n<p>if it can. This eliminates a bit of overhead from the call.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private string _input = &#8220;!@#$%^&amp;&#8221;;<\/p>\n<p>    [Benchmark]<br \/>\n    public bool Contains() =&gt; _input.Contains(&#8220;$&#8221;);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Contains<br \/>\n.NET 8.0<br \/>\n3.7649 ns<br \/>\n1.00<\/p>\n<p>Contains<br \/>\n.NET 9.0<br \/>\n0.9614 ns<br \/>\n0.26<\/p>\n<h3>Regex<\/h3>\n<p>Regular expression support in .NET has received a lot of love over the past few years. The implementation was overhauled in <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/regex-performance-improvements-in-net-5\/\">.NET 5<\/a> to yield significant performance gains, and then in <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/regular-expression-improvements-in-dotnet-7\/\">.NET 7<\/a> it not only saw another round of huge performance gains, it also gained a source generator, a new non-backtracking implementation, and more. In .NET 8, it saw <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-8\/#regex\">additional performance improvements<\/a>, in part because of using SearchValues.<\/p>\n<p>Now in .NET 9, the trend continues. First and foremost, it\u2019s important to recognize that many of the changes discussed thus far implicitly accrue to Regex. Regex already uses SearchValues, and so improvements to SearchValues benefit Regex (it\u2019s one of my favorite things about working at the lowest levels of the stack: improvements there have a multiplicative effect, in that direct use of them improves, but so too does indirect use via intermediate components that instantly get better as the lower level does). Beyond that, though, Regex has increased its reliance on SearchValues.<\/p>\n<p>There are multiple engines backing Regex today:<\/p>\n<p>An interpreter, which is what you get when you don\u2019t explicitly ask for one of the other engines.<br \/>\nA reflection-emit-based compiler, which at run-time emits custom IL for the specific regular expression and options. This is what you get when you specify RegexOptions.Compiled.<br \/>\nA non-backtracking engine, which doesn\u2019t support all of Regex\u2018s features but which guarantees O(N) throughput in the length of the input. This is what you get when you specify RegexOptions.NonBacktracking.<br \/>\nAnd a source generator, which is very similar to the compiler, except it emits C# at build-time rather than emitting IL at run-time. This is what you get when you use [GeneratedRegex(&#8230;)].<\/p>\n<p>As of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98791\">dotnet\/runtime#98791<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103496\">dotnet\/runtime#103496<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98880\">dotnet\/runtime#98880<\/a>, all of the engines other than the interpreter avail themselves of the new SearchValues&lt;string&gt; support (the interpreter could as well, but we make an assumption that someone is using the interpreter in order to optimize for the speed of Regex construction, and the analyses involved in choosing to use SearchValues&lt;string&gt; can take measurable time). The best way to see what this looks like is via the source generator, as we can easily examine the code it outputs in both .NET 8 and .NET 9. Consider this code:<\/p>\n<p>using System.Text.RegularExpressions;<\/p>\n<p>internal partial class Example<br \/>\n{<br \/>\n    [GeneratedRegex(&#8220;(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday): (.*)&#8221;, RegexOptions.IgnoreCase)]<br \/>\n    public static partial Regex ParseEntry();<br \/>\n}<\/p>\n<p>In Visual Studio, you can right-click on ParseEntry, select \u201cGo To Definition,\u201d and the tool will take you to the C# code for this pattern as generated by the regular expression source generator (the pattern is looking for a day of the week, followed by a colon, then followed by any text, and is capturing both the day and that subsequent text into capture groups for subsequent exploration). The generated code contains two relevant methods: a TryFindNextPossibleStartingPosition method, which is used to skip ahead as quickly as possible to the first location that might possibly match, and a TryMatchAtCurrentPosition method, which performs the full match attempt at that location. For our purposes here, we care about TryFindNextPossibleStartingPosition, as that\u2019s the most impactful place SearchValues shows up. On .NET 8, we see this code:<\/p>\n<p>private bool TryFindNextPossibleStartingPosition(ReadOnlySpan&lt;char&gt; inputSpan)<br \/>\n{<br \/>\n    int pos = base.runtextpos;<br \/>\n    ulong charMinusLowUInt64;<\/p>\n<p>    \/\/ Any possible match is at least 8 characters.<br \/>\n    if (pos &lt;= inputSpan.Length &#8211; 8)<br \/>\n    {<br \/>\n        \/\/ The pattern matches a character in the set [ADSYadsy] at index 5.<br \/>\n        \/\/ Find the next occurrence. If it can&#8217;t be found, there&#8217;s no match.<br \/>\n        ReadOnlySpan&lt;char&gt; span = inputSpan.Slice(pos);<br \/>\n        for (int i = 0; i &lt; span.Length &#8211; 7; i++)<br \/>\n        {<br \/>\n            int indexOfPos = span.Slice(i + 5).IndexOfAny(Utilities.s_ascii_1200080212000802);<br \/>\n            if (indexOfPos &lt; 0)<br \/>\n            {<br \/>\n                goto NoMatchFound;<br \/>\n            }<br \/>\n            i += indexOfPos;<\/p>\n<p>            if (((long)((0x8106400081064000UL &lt;&lt; (int)(charMinusLowUInt64 = (uint)span[i] &#8211; &#8216;F&#8217;)) &amp; (charMinusLowUInt64 &#8211; 64)) &lt; 0) &amp;&amp;<br \/>\n                ((long)((0x8023400080234000UL &lt;&lt; (int)(charMinusLowUInt64 = (uint)span[i + 3] &#8211; &#8216;D&#8217;)) &amp; (charMinusLowUInt64 &#8211; 64)) &lt; 0))<br \/>\n            {<br \/>\n                base.runtextpos = pos + i;<br \/>\n                return true;<br \/>\n            }<br \/>\n        }<br \/>\n    }<\/p>\n<p>    \/\/ No match found.<br \/>\n    NoMatchFound:<br \/>\n    base.runtextpos = inputSpan.Length;<br \/>\n    return false;<br \/>\n}<\/p>\n<p>The code is using an IndexOfAny with Utilities.s_ascii_1200080212000802; what is that? It\u2019s a SearchValues&lt;char&gt;:<\/p>\n<p>\/\/\/ &lt;summary&gt;Supports searching for characters in or not in &#8220;ADSYadsy&#8221;.&lt;\/summary&gt;<br \/>\ninternal static readonly SearchValues&lt;char&gt; s_ascii_1200080212000802 = SearchValues.Create(&#8220;ADSYadsy&#8221;);<\/p>\n<p>The source generator is employing the approach we looked at earlier, searching for a single character from each string. Here it\u2019s decided that its best chance for an optimal search is to look for the character at offset 5 in each string, so \u2018y\u2019 for \u201cMonday\u201d, \u2018a\u2019 for \u201cTuesday\u201d, etc., plus looking for the upper-case variants since RegexOptions.IgnoreCase was specified. Then after the single-character search, it\u2019s doing a quick test for a couple of other positions in the string to try to weed out false positives, looking at the 0th offset to ensure the character is in the set [MTWFSmtwfs] and the 3rd offset to ensure the character is in the set [DSNRUdsnru]. (The check for those is obscured by it using a branchless technique to query a bitmap stored in a 64-bit ulong.)<\/p>\n<p>Now, here\u2019s what we get in .NET 9:<\/p>\n<p>private bool TryFindNextPossibleStartingPosition(ReadOnlySpan&lt;char&gt; inputSpan)<br \/>\n{<br \/>\n    int pos = base.runtextpos;<\/p>\n<p>    \/\/ Any possible match is at least 8 characters.<br \/>\n    if (pos &lt;= inputSpan.Length &#8211; 8)<br \/>\n    {<br \/>\n        \/\/ The pattern has multiple strings that could begin the match. Search for any of them.<br \/>\n        \/\/ If none can be found, there&#8217;s no match.<br \/>\n        int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfAnyStrings_OrdinalIgnoreCase_B7E3C0B8368AC400913BEA56D1872F43698FDA2C54D1AD4886F6734244613374);<br \/>\n        if (i &gt;= 0)<br \/>\n        {<br \/>\n            base.runtextpos = pos + i;<br \/>\n            return true;<br \/>\n        }<br \/>\n    }<\/p>\n<p>    \/\/ No match found.<br \/>\n    base.runtextpos = inputSpan.Length;<br \/>\n    return false;<br \/>\n}<\/p>\n<p>Again, we see an IndexOfAny, but notice that the subsequent checks for other positions are gone. Why? Because that SearchValues being passed to the IndexOfAny now is a SearchValues&lt;string&gt;, and thus already confirms that one of the provided strings matches:<\/p>\n<p>\/\/\/ &lt;summary&gt;Supports searching for the specified strings.&lt;\/summary&gt;<br \/>\ninternal static readonly SearchValues&lt;string&gt; s_indexOfAnyStrings_OrdinalIgnoreCase_B7E3C0B8368AC400913BEA56D1872F43698FDA2C54D1AD4886F6734244613374 =<br \/>\n    SearchValues.Create(<br \/>\n        [&#8220;monday&#8221;, &#8220;tuesday&#8221;, &#8220;wednesda&#8221;, &#8220;thursday&#8221;, &#8220;friday&#8221;, &#8220;saturday&#8221;, &#8220;sunday&#8221;],<br \/>\n        StringComparison.OrdinalIgnoreCase);<\/p>\n<p>The sharp-eyed amongst you might notice that there\u2019s no \u2018y\u2019 at the end of \u201cWednesday\u201d; that\u2019s simply due to a heuristic in the Regex implementation. When it searches for strings to use as part of such a SearchValues&lt;string&gt;, it limits itself to using no more than length 8 strings. And \u201csearches\u201d is an appropriate word here, as the implementation isn\u2019t limited just to clean alternations as in the previous example. If I instead change the program to be:<\/p>\n<p>using System.Text.RegularExpressions;<\/p>\n<p>internal partial class Example<br \/>\n{<br \/>\n    [GeneratedRegex(&#8220;[Aa]([Bb][Cc]|[Dd][Ee])&#8221;)]<br \/>\n    public static partial Regex ParseEntry();<br \/>\n}<\/p>\n<p>we still end up with a SearchValues&lt;string&gt;, now for this:<\/p>\n<p>\/\/\/ &lt;summary&gt;Supports searching for the specified strings.&lt;\/summary&gt;<br \/>\ninternal static readonly SearchValues&lt;string&gt; s_indexOfAnyStrings_OrdinalIgnoreCase_33A76C255741CD9630059173F803FB92EBDDFBF62328261428CF8838D6379CE9 =<br \/>\n    SearchValues.Create([&#8220;abc&#8221;, &#8220;ade&#8221;], StringComparison.OrdinalIgnoreCase);<\/p>\n<p>Interestingly, as of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96402\">dotnet\/runtime#96402<\/a>, SearchValues&lt;string&gt; will also be used when doing a single string search. As previously noted, IndexOf(string) will try to pick two characters and do a vectorized search for both, whereas SearchValues&lt;string&gt; for that same input can spend a bit more time trying to pick more characters and characters that will be better for the search. As such, Regex now opts to use SearchValues&lt;string&gt; as part of TryFindNextPossibleStartingPosition. We can see this with the following benchmarks that count the number of occurrences of the words \u201cHello\u201d or \u201cearth\u201d:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Text.RegularExpressions;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic partial class Tests<br \/>\n{<br \/>\n    public static readonly string s_markTwain = new HttpClient().GetStringAsync(&#8220;https:\/\/www.gutenberg.org\/cache\/epub\/3200\/pg3200.txt&#8221;).Result;<\/p>\n<p>    [GeneratedRegex(@&#8221;bHellob&#8221;)]<br \/>\n    public static partial Regex FindHello();<\/p>\n<p>    [GeneratedRegex(@&#8221;bearthb&#8221;)]<br \/>\n    public static partial Regex FindEarth();<\/p>\n<p>    [Benchmark]<br \/>\n    public int CountHello() =&gt; FindHello().Count(s_markTwain);<\/p>\n<p>    [Benchmark]<br \/>\n    public int CountEarth() =&gt; FindEarth().Count(s_markTwain);<br \/>\n}<\/p>\n<p>On .NET 8, the code generated for TryFindNextPossibleStartingPosition for FindEarth includes:<\/p>\n<p>int i = inputSpan.Slice(pos).IndexOf(&#8220;earth&#8221;);<\/p>\n<p>whereas on .NET 9, the generated code is:<\/p>\n<p>int i = inputSpan.Slice(pos).IndexOfAny(Utilities.s_indexOfString_earth_Ordinal);<br \/>\n&#8230;<br \/>\ninternal static readonly SearchValues&lt;string&gt; s_indexOfString_earth_Ordinal =<br \/>\n    SearchValues.Create([&#8220;earth&#8221;], StringComparison.Ordinal);<\/p>\n<p>And the results:<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>CountHello<br \/>\n.NET 8.0<br \/>\n2.020 ms<br \/>\n1.00<\/p>\n<p>CountHello<br \/>\n.NET 9.0<br \/>\n2.042 ms<br \/>\n1.01<\/p>\n<p>CountEarth<br \/>\n.NET 8.0<br \/>\n2.738 ms<br \/>\n1.00<\/p>\n<p>CountEarth<br \/>\n.NET 9.0<br \/>\n2.339 ms<br \/>\n0.85<\/p>\n<p>This highlights that using SearchValues&lt;string&gt; in this one-string case doesn\u2019t always help, but it can improve things, in particular in situations where the extra one-time work done by the SearchValues.Create enables it to find meaningfully better characters for which to search.<\/p>\n<p>My seeming obsession with SearchValues&lt;T&gt; might lead one to believe that it\u2019s the only source of improvements in Regex, but that\u2019s far from the truth. There are many other PRs in .NET 9 focused on different aspects that improved the area.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/93190\">dotnet\/runtime#93190<\/a> is a nice addition. One of the optimizations introduced to Regex in .NET 7 was a \u201cliteral-after-loop\u201d search. A lot of effort goes into finding ways to help Regex\u2018s TryFindNextPossibleStartingPosition be as efficient as possible at skipping unnecessary locations, and this \u201cliteral-after-loop\u201d search is one such mechanism. It looks for a particular shape of pattern, where the pattern starts with a loop that\u2019s then followed by a literal. For example, the industry regex benchmarks in <a href=\"https:\/\/github.com\/mariomka\/regex-benchmark\/blob\/17d073ec864931546e2694783f6231e4696a9ed4\/csharp\/Benchmark.cs\">mariomka\/regex-benchmark<\/a> includes this pattern for finding URIs:<\/p>\n<p>@&#8221;[w]+:\/\/[^\/s?#]+[^s?#]+(?:?[^s#]*)?(?:#[^s]*)?&#8221;<\/p>\n<p>The pattern starts with a word character loop. We don\u2019t have a good way to vectorize a search for any word character, nor would we really want to; there are over 50,000 word characters that are part of the w set, and in most inputs we\u2019d typically find an occurrence so quickly that it wouldn\u2019t be worth the vectorization. However, the &#8220;:\/\/&#8221; that follows is easily searchable and much less likely to occur, making it a good candidate for TryFindNextPossibleStartingPosition. However, we can\u2019t just search for the &#8220;:\/\/&#8221; because it doesn\u2019t start the pattern, nor is it at a fixed-offset from the beginning of the pattern that would enable us to find the &#8220;:\/\/&#8221; and then jump backwards a known number of positions. Instead, with the \u201cliteral-after-loop\u201d optimization, we can find the &#8220;:\/\/&#8221; and then iterate backwards to the beginning of the loop in order to find the actual starting position for the match attempt (we can also keep track of where the loop ends so that we don\u2019t need to re-match it).<\/p>\n<p>There were, however, a number of gaps in this optimization. Most notably, the implementation needs to examine the pattern to determine whether it\u2019s applicable. If the starting loop was wrapped in a capture or an atomic group, it was unnecessarily giving up and would fail to discover the loop for the purposes of enabling the \u201cliteral-after-loop\u201d mechanism. The search would also give up if the literal after the loop was a set inside of various grouping constructs, like a concatenation.<\/p>\n<p>This PR fixed those gaps. The impact of this can be seen by looking at another industry benchmark, this time from the <a href=\"https:\/\/github.com\/BurntSushi\/rebar#ruff-noqa\">BurntSushi\/rebar<\/a> repo:<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Text.RegularExpressions;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic partial class Tests<br \/>\n{<br \/>\n    private static string s_haystack = new HttpClient().GetStringAsync(&#8220;https:\/\/raw.githubusercontent.com\/BurntSushi\/rebar\/master\/benchmarks\/haystacks\/wild\/cpython-226484e4.py&#8221;).Result;<\/p>\n<p>    [GeneratedRegex(@&#8221;(s*)((?:# [Nn][Oo][Qq][Aa])(?::s?(([A-Z]+[0-9]+(?:[,s]+)?)+))?)&#8221;)]<br \/>\n    private static partial Regex RuffNoQA();<\/p>\n<p>    [Benchmark]<br \/>\n    public int Count() =&gt; RuffNoQA().Count(s_haystack);<br \/>\n}<\/p>\n<p>The impact of the literal-after-loop optimization ends up being obvious in the resulting numbers:<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Count<br \/>\n.NET 8.0<br \/>\n197.47 ms<br \/>\n1.00<\/p>\n<p>Count<br \/>\n.NET 9.0<br \/>\n18.67 ms<br \/>\n0.09<\/p>\n<p>Improvements in Regex go beyond just the initial searching, as well. An interesting changes comes in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98723\">dotnet\/runtime#98723<\/a>, not because it results in massive performance improvements (though it does yield some nice benefits), but rather because it highlights how improvement possibilities can be found in all manner of places. One of the areas we (and pretty much everyone else in the world it seems) has been investing a lot of energy into lately is AI, including into tokenizers, which are components that take text and translate it into a series of numbers that are meaningful to the model into which they\u2019ll be fed. Each model is trained on a set of tokens from a specific tokenizer, and in the case of OpenAI\u2019s models, that tokenizer algorithm is \u201ctiktoken.\u201d The official .NET implementation of tiktoken lives in the <a href=\"https:\/\/www.nuget.org\/packages\/Microsoft.ML.Tokenizers\">Microsoft.ML.Tokenizers<\/a> library, and as part of implementing tiktoken, it follows the reference implementation provided by OpenAI, which uses <a href=\"https:\/\/github.com\/openai\/tiktoken\/blob\/c0ba74c238d18b4824c25f3c27fc8698055b9a76\/tiktoken_ext\/openai_public.py#L103\">a regular expression as part of parsing<\/a>; for consistency and to help ensure correctness, therefore, the .NET implementation does as well. This regex includes the following pattern:<\/p>\n<p>(?i:&#8217;s|&#8217;t|&#8217;re|&#8217;ve|&#8217;m|&#8217;ll|&#8217;d)<\/p>\n<p>What jumped out about this pattern is that it should trigger an optimization in the regex source generator that emits alternations like this as a C# switch statement, as the analyzer should be able to determine that all branches of the alternation are distinct, such that picking one branch because the first character matches necessarily means that no other branch could match. The benefit of a switch here is it allows the C# compiler to implement a jump table, which means we\u2019re not stuck exploring each branch when we could instead jump right to the correct one. But that optimization wasn\u2019t kicking in. Why? A series of unfortunate events. An earlier optimization was seeing the ll and rewriting that into a repeater (l{2}), which then defeated this alternation optimization because the implementation wasn\u2019t written to examine loops. Loops were explicitly being skipped because a loop could be empty, and if empty, it wouldn\u2019t have a first character required by the switch. However, we can see whether a loop has a non-zero minimum bound on number of iterations set, as it does in this case, and in such cases we can still factor in up to that minimum number, which are all guaranteed iterations. This PR improved the analysis to handle loops well, as evidenced by this micro-benchmark (which has been crafted to accentuate this aspect of the pattern):<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Text.RegularExpressions;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic partial class Tests<br \/>\n{<br \/>\n    private static string s_haystack = new string(&#8216;y&#8217;, 10_000);<\/p>\n<p>    [GeneratedRegex(&#8220;(?i:a|bb|c|dd|e|ff|g|hh|i|jj|k|ll|m|nn|o|pp|q|rr|s|tt|u|vv|w|xx|y|zz)&#8221;)]<br \/>\n    public static partial Regex Parse();<\/p>\n<p>    [Benchmark]<br \/>\n    public int Count() =&gt; Parse().Count(s_haystack);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Count<br \/>\n.NET 8.0<br \/>\n315.7 us<br \/>\n1.00<\/p>\n<p>Count<br \/>\n.NET 9.0<br \/>\n211.6 us<br \/>\n0.67<\/p>\n<p>The non-backtracking engine also got some nice attention in .NET 9. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102655\">dotnet\/runtime#102655<\/a> from <a href=\"https:\/\/github.com\/ieviev\">@ieviev<\/a> (who submitting a small subset of the changes they\u2019d made as part of some exciting <a href=\"http:\/\/arxiv.org\/abs\/2407.20479\">regex research<\/a> being done in a fork of the library), followed by <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104766\">dotnet\/runtime#104766<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/105668\">dotnet\/runtime#105668<\/a> made a variety of changes to the non-backtracking implementation, including:<\/p>\n<p><strong>DFA limits.<\/strong> The non-backtracking implementation works by constructing a finite automata, which can be thought of as a graph, with the implementation walking around the graph as it consumes additional characters from the input and uses those to guide what node(s) it transitions to next. The graph is built out lazily, such that nodes are only added as those states are explored, and the nodes can be one of two kinds: DFA (deterministic) or NFA (non-deterministic). DFA nodes ensure that for any given character that comes next in the input, there\u2019s only ever one possible node to which to transition. Not so for NFA, where at any point in time there\u2019s a list of all the possible nodes the system could be in, and moving to the next state means examining each of the current states, finding all possible transitions out of each, and treating the union of all of those new positions as the next state. DFA is thus <em>much<\/em> cheaper than NFA in terms of the overheads involved in walking around the graph, and we want to fall back to NFA only when we absolutely have to, which is when the DFA graph would be too large: some patterns have the potential to create massive numbers of DFA nodes. Thus, there\u2019s a threshold where once that number of constructed nodes in the graph is hit, new nodes are constructed as NFA rather than DFA. In .NET 8 and earlier, that limit was somewhat arbitrarily set at 10,000. For .NET 9 as part of this PR, analysis was done to show that a much higher limit was worth the memory trade-offs, and the limit was raised to 125,000, which means many more patterns can fully execute as DFA.<br \/>\n<strong>Minterm mappings.<\/strong> The implementation works in terms of \u201cminterms,\u201d which are equivalence classes for all characters that behave the same in the pattern. For example, with the pattern &#8220;[a-z]*&#8221;, the lowercase ASCII letters are all treated the same and are all treated differently from every other character, so there are two minterms here, one for the 26 lowercase ASCII letters, and the other for the remaining 65,510 characters. This is used as a compression mechanism, as rather than needing to describe the transitions between nodes for every character, the system can instead do so for every minterm. Of course, that means during matching there\u2019s a step where a character needs to be mapped to its minterm in order to know which edge to follow to the next state. Previously, that mapping was cached for all ASCII characters but recomputed each time for non-ASCII (recomputing it amounts to a binary search on a tree data structure). As you can imagine, this can lead to significant overhead when non-ASCII is encountered. Now in .NET 9, mappings for all characters represented in the pattern are stored. In degenerate cases, this can measurably increase memory consumption for the Regex instance, but on average it doesn\u2019t; in fact, for common cases the new scheme actually reduces memory consumption, as it takes into account the fact that all but the most niche patterns have fewer than 256 minterms, and the per-character mapping can thus be stored in a byte rather than a ushort or uint. Additionally, for cases where only a subset of ASCII is used in the pattern (which is common), the Regex instance needn\u2019t allocate an array to represent all 128 ASCII characters, but can instead be shrunk to only support those characters that need be represented.<br \/>\n<strong>Timeout checks.<\/strong> Regex has long supported a timeout mechanism, where if a match operation takes longer than a specified limit, an exception is thrown. This mechanism exists to help mitigate possible regex denial of service (ReDOS) attacks, where a maliciously-constructed pattern when fed to a backtracking engine could lead to \u201ccatastrophic backtracking\u201d (you can see an example of this in my <a href=\"https:\/\/www.youtube.com\/watch?v=ptKjWPC7pqw\">Deep .NET discussion on Regex<\/a> with Scott Hanselman). These timeouts are thus enabled in the interpreter, the compiler, and the source generator. For the non-backtracking engine, timeouts aren\u2019t necessary to avoid catastrophic backtracking, as there is no backtracking. The engine still pays some attention to timeouts, though, purely for consistency with the other engines, yet the frequency of the checks was actually adding measurable overhead in some cases. The PR reduced the frequency of the checks to mitigate that overhead while not meaningfully affecting the effectiveness of the checks.<br \/>\n<strong>Hot path inner loop.<\/strong> The inner matching loop is the hot path for a matching operation: read the next character, look up its minterm, follow the corresponding edge to the next node in the graph, rinse and repeat. Performance of the engine is tied to efficiency of this loop. These PRs recognized that there were some checks being performed in that inner loop which were only relevant to a minority of patterns. For the majority, the code could be specialized such that those checks wouldn\u2019t be needed in the hot path.<br \/>\n<strong>General good hygiene.<\/strong> Care was taken to remove unnecessary overheads, such as duplicate array lookups that could be removed, bounds checks that could be avoided, indirect reads via refs that could instead be done against locals, and so on.<\/p>\n<p>The net result of these changes is most patterns get faster, some significantly, especially on non-ASCII inputs.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Text.RegularExpressions;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    public static readonly string s_aristotle = new HttpClient().GetStringAsync(&#8220;https:\/\/www.gutenberg.org\/cache\/epub\/39963\/pg39963.txt&#8221;).Result;<br \/>\n    private readonly Regex _word = new Regex(@&#8221;b[p{IsGreek}]+b&#8221;, RegexOptions.NonBacktracking);<\/p>\n<p>    [Benchmark]<br \/>\n    public int CountWords() =&gt; _word.Matches(s_aristotle).Count;<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>CountWords<br \/>\n.NET 8.0<br \/>\n14.808 ms<br \/>\n1.00<\/p>\n<p>CountWords<br \/>\n.NET 9.0<br \/>\n9.673 ms<br \/>\n0.65<\/p>\n<p>The <a href=\"https:\/\/github.com\/dotnet\/runtime\">dotnet\/runtime<\/a> employs an automated performance regression testing system, with tests in <a href=\"https:\/\/github.com\/dotnet\/performance\">dotnet\/performance<\/a> constantly running on various operating systems and hardware, with the goal of detecting regressions. When a possible regression is noticed, an issue is opened containing the details. However, the system also notices statistically-significant improvements and also opens issues on those, just to ensure that we\u2019re all aware of when and how things change in a meaningful way. When possible, the issues reference the PR known to have caused the regression or improvement, so it\u2019s always a treat to see a list of references like this on a PR, as was the case with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102655\">dotnet\/runtime#102655<\/a>:\n<\/p>\n<p>Both the non-backtracking engine and the interpreter now also gain additional optimized searching for certain classes of prefixes they didn\u2019t previously support. With <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100315\">dotnet\/runtime#100315<\/a>, patterns that begin with ranges can now be optimized with an IndexOfAny{Except}InRange call, whereas previously such patterns would likely result in walking character by character.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Text.RegularExpressions;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    public static readonly string s_markTwain = new HttpClient().GetStringAsync(&#8220;https:\/\/www.gutenberg.org\/cache\/epub\/3200\/pg3200.txt&#8221;).Result;<\/p>\n<p>    private readonly Regex _interpreter = new Regex(@&#8221;b[0-9]+b&#8221;);<br \/>\n    private readonly Regex _nonBacktracking = new Regex(@&#8221;b[0-9]+b&#8221;, RegexOptions.NonBacktracking);<\/p>\n<p>    [Benchmark]<br \/>\n    public int Interpreter() =&gt; _interpreter.Count(s_markTwain);<\/p>\n<p>    [Benchmark]<br \/>\n    public int NonBacktracking() =&gt; _nonBacktracking.Count(s_markTwain);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Interpreter<br \/>\n.NET 8.0<br \/>\n21.223 ms<br \/>\n1.00<\/p>\n<p>Interpreter<br \/>\n.NET 9.0<br \/>\n1.726 ms<br \/>\n0.08<\/p>\n<p>NonBacktracking<br \/>\n.NET 8.0<br \/>\n21.945 ms<br \/>\n1.00<\/p>\n<p>NonBacktracking<br \/>\n.NET 9.0<br \/>\n1.749 ms<br \/>\n0.08<\/p>\n<p>Finally, Regex gains some new APIs in .NET 9, focused on performance. Regex currently has a set of Split overloads; these logically behave like Match, except instead of returning what matched, they effectively return what\u2019s between the matches, treating the match as a split separator. As with string.Split, these Regex.Split methods return a string[], which means allocating the array to store all the results and allocating each of the individual strings. There was also no overload for supporting span inputs, which meant if one had a span to search, that span would first need to be converted into a string, yet another allocation. .NET 7 saw a similar predicament fixed with the introduction of the EnumerateMatches method, which provided an allocation-free alternative to Match or Matches. Now in .NET 9, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103307\">dotnet\/runtime#103307<\/a>, Regex gets new EnumerateSplits methods, that similarly provide an allocation-free way to access the same splits. The method accepts a ReadOnlySpan&lt;char&gt;, and then rather than returning an array of strings, it returns an enumerator of Ranges pointing into the original.<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Text.RegularExpressions;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    public static readonly string s_markTwain = new HttpClient().GetStringAsync(&#8220;https:\/\/www.gutenberg.org\/cache\/epub\/3200\/pg3200.txt&#8221;).Result;<br \/>\n    private readonly Regex _whitespace = new Regex(@&#8221;s+&#8221;, RegexOptions.Compiled);<\/p>\n<p>    [Benchmark(Baseline = true)]<br \/>\n    public int SplitOnWhitespace_Split()<br \/>\n    {<br \/>\n        int lengths = 0;<br \/>\n        foreach (string split in _whitespace.Split(s_markTwain))<br \/>\n        {<br \/>\n            lengths += split.Length;<br \/>\n        }<br \/>\n        return lengths;<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public int SplitOnWhitespace_EnumerateSplits()<br \/>\n    {<br \/>\n        int lengths = 0;<br \/>\n        foreach (Range range in _whitespace.EnumerateSplits(s_markTwain))<br \/>\n        {<br \/>\n            ReadOnlySpan&lt;char&gt; split = s_markTwain.AsSpan(range);<br \/>\n            lengths += split.Length;<br \/>\n        }<br \/>\n        return lengths;<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>SplitOnWhitespace_Split<br \/>\n189.1 ms<br \/>\n1.00<br \/>\n185305389 B<br \/>\n1.000<\/p>\n<p>SplitOnWhitespace_EnumerateSplits<br \/>\n116.6 ms<br \/>\n0.62<br \/>\n272 B<br \/>\n0.000<\/p>\n<h3>Encoding<\/h3>\n<p>Base64 encoding has been supported in .NET since the beginning, with methods like Convert.ToBase64String and Convert.FromBase64CharArray. More recently, a plethora of Base64-related APIs have been added, including span-based APIs on Convert but also a dedicated System.Buffers.Text.Base64 with methods for encoding and decoding between arbitrary bytes and UTF8 text, and most recently for very efficiently checking whether UTF8 and UTF16 text represents a valid Base64 payload.<\/p>\n<p>Base64 is a fairly simple encoding scheme for taking arbitrary binary data and converting it to ASCII text, splitting the input up into groups of 6 bits (2^6 == 64 possible values) and mapping each of those values to a specific character in the Base64 alphabet: the 26 upper-case ASCII letters, the 26 lower-case ASCII letters, the 10 ASCII digits, &#8216;+&#8217;, and &#8216;\/&#8217;. While this is an incredibly popular encoding mechanism, it runs into problems for some use cases because of the exact choice of alphabet. Including Base64 data in a URI is possibly problematic, as &#8216;+&#8217; and &#8216;\/&#8217; both have special meaning in URIs, as does the special &#8216;=&#8217; symbol used for padding Base64 data out to a specific length. That means that in addition to Base64-encoding data, the resulting data might also need to be URL-encoded for such a use, both taking additional time and further increasing the size of the payload. To address this, a variant was introduced, Base64Url, which does away with the need for padding, and which uses a slightly different alphabet, &#8216;-&#8216; instead of &#8216;+&#8217; and &#8216;_&#8217; instead of &#8216;\/&#8217;. Base64Url is used in a variety of domains, including as part of <a href=\"https:\/\/en.wikipedia.org\/wiki\/JSON_Web_Token\">JSON Web Tokens (JWT)<\/a>, where it\u2019s used to encode each segment of the token.<\/p>\n<p>While .NET has had Base64 support for a long time, it hasn\u2019t had Base64Url support, and as such, developers have had to craft their own. Many have done so by layering on top of the Base64 implementations in Convert or Base64. For example, here\u2019s what the core part of ASP.NET\u2019s implementation for WebEncoders.Base64UrlEncode looked like in .NET 8:<\/p>\n<p>private static int Base64UrlEncode(ReadOnlySpan&lt;byte&gt; input, Span&lt;char&gt; output)<br \/>\n{<br \/>\n    if (input.IsEmpty)<br \/>\n        return 0;<\/p>\n<p>    Convert.TryToBase64Chars(input, output, out int charsWritten);<\/p>\n<p>    for (var i = 0; i &lt; charsWritten; i++)<br \/>\n    {<br \/>\n        var ch = output[i];<br \/>\n        if (ch == &#8216;+&#8217;) output[i] = &#8216;-&#8216;;<br \/>\n        else if (ch == &#8216;\/&#8217;) output[i] = &#8216;_&#8217;;<br \/>\n        else if (ch == &#8216;=&#8217;) return i;<br \/>\n    }<\/p>\n<p>    return charsWritten;<br \/>\n}<\/p>\n<p>We can obviously write more code to do that more efficiently, but with .NET 9 we don\u2019t have to. With <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102364\">dotnet\/runtime#102364<\/a>, .NET now has a fully-featured Base64Url type that is also very efficient. It actually shares almost all of its implementation with the same functionality on Base64 and Convert, using generic tricks to substitute the different alphabets in an optimized manner. (The ASP.NET implementation has also been updated to use Base64Url with <a href=\"https:\/\/github.com\/dotnet\/aspnetcore\/pull\/56959\">dotnet\/aspnetcore#56959<\/a> and <a href=\"https:\/\/github.com\/dotnet\/aspnetcore\/pull\/57050\">dotnet\/aspnetcore#57050<\/a>.)<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Buffers.Text;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private byte[] _data;<br \/>\n    private char[] _destination = new char[Base64.GetMaxEncodedToUtf8Length(1024 * 1024)];<\/p>\n<p>    [GlobalSetup]<br \/>\n    public void Setup()<br \/>\n    {<br \/>\n        _data = new byte[1024 * 1024];<br \/>\n        new Random(42).NextBytes(_data);<br \/>\n    }<\/p>\n<p>    [Benchmark(Baseline = true)]<br \/>\n    public int Old() =&gt; Base64UrlOld(_data, _destination);<\/p>\n<p>    [Benchmark]<br \/>\n    public int New() =&gt; Base64Url.EncodeToChars(_data, _destination);<\/p>\n<p>    static int Base64UrlOld(ReadOnlySpan&lt;byte&gt; input, Span&lt;char&gt; output)<br \/>\n    {<br \/>\n        if (input.IsEmpty)<br \/>\n            return 0;<\/p>\n<p>        Convert.TryToBase64Chars(input, output, out int charsWritten);<\/p>\n<p>        for (var i = 0; i &lt; charsWritten; i++)<br \/>\n        {<br \/>\n            var ch = output[i];<br \/>\n            if (ch == &#8216;+&#8217;)<br \/>\n            {<br \/>\n                output[i] = &#8216;-&#8216;;<br \/>\n            }<br \/>\n            else if (ch == &#8216;\/&#8217;)<br \/>\n            {<br \/>\n                output[i] = &#8216;_&#8217;;<br \/>\n            }<br \/>\n            else if (ch == &#8216;=&#8217;)<br \/>\n            {<br \/>\n                return i;<br \/>\n            }<br \/>\n        }<\/p>\n<p>        return charsWritten;<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Old<br \/>\n1,314.20 us<br \/>\n1.00<\/p>\n<p>New<br \/>\n81.36 us<br \/>\n0.06<\/p>\n<p>This also benefits from a set of changes that improved the performance of Base64, and thus also Base64Url, since they now share the same code. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92241\">dotnet\/runtime#92241<\/a> from <a href=\"https:\/\/github.com\/DeepakRajendrakumaran\">@DeepakRajendrakumaran<\/a> added an AVX512-optimized Base64 encoding\/decoding implementation, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/95513\">dotnet\/runtime#95513<\/a> from <a href=\"https:\/\/github.com\/SwapnilGaikwad\">@SwapnilGaikwad<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100589\">dotnet\/runtime#100589<\/a> from <a href=\"https:\/\/github.com\/SwapnilGaikwad\">@SwapnilGaikwad<\/a> optimized Base64 encoding and decoding for Arm64.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private byte[] _toEncode;<br \/>\n    private char[] _encoded;<\/p>\n<p>    [GlobalSetup]<br \/>\n    public void Setup()<br \/>\n    {<br \/>\n        _toEncode = new byte[1000];<br \/>\n        new Random(42).NextBytes(_toEncode);<br \/>\n        _encoded = new char[Convert.ToBase64String(_toEncode).Length];<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public void ConvertToBase64() =&gt; Convert.ToBase64CharArray(_toEncode, 0, _toEncode.Length, _encoded, 0);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>ConvertToBase64<br \/>\n.NET 8.0<br \/>\n104.55 ns<br \/>\n1.00<\/p>\n<p>ConvertToBase64<br \/>\n.NET 9.0<br \/>\n60.19 ns<br \/>\n0.58<\/p>\n<p>Another simpler form of encoding is hex, effectively employing an alphabet of 16 characters (for each group of 4 bits) rather than 64 (for each group of 6 bits). .NET 5 introduced the Convert.ToHexString set of methods, which take an input ReadOnlySpan&lt;byte&gt; or byte[] and produce an output string with two hex chars per input byte. The alphabet selected for that encoding are the hexademical characters of \u20180\u2019 through \u20189\u2019 and then upper-case \u2018A\u2019 through \u2018F\u2019. That\u2019s great when you want upper-case, but sometimes you want the lower-case \u2018a\u2019 through \u2018f\u2019 instead. As a result, it\u2019s not uncommon now to see calls like this:<\/p>\n<p>string result =  Convert.ToHexString(bytes).ToLowerInvariant();<\/p>\n<p>where ToHexString produces one string and then ToLowerInvariant possibly produces another (\u201cpossibly\u201d because it\u2019ll only need to create a new string if the data contained any letters). With .NET 9 and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92483\">dotnet\/runtime#92483<\/a> from <a href=\"https:\/\/github.com\/determ1ne\">@determ1ne<\/a>, the new Convert.ToHexStringLower methods may be used to go directly to the lower-case version; that PR also introduced the TryToHexString and TryToHexStringLower methods, which format directly into a provided destination span rather than allocating anything.<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Buffers.Text;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private byte[] _data = new byte[100];<br \/>\n    private char[] _dest = new char[200];<\/p>\n<p>    [GlobalSetup]<br \/>\n    public void Setup() =&gt; new Random(42).NextBytes(_data);<\/p>\n<p>    [Benchmark(Baseline = true)]<br \/>\n    public string Old() =&gt; Convert.ToHexString(_data).ToLowerInvariant();<\/p>\n<p>    [Benchmark]<br \/>\n    public string New() =&gt; Convert.ToHexStringLower(_data).ToLowerInvariant();<\/p>\n<p>    [Benchmark]<br \/>\n    public bool NewTry() =&gt; Convert.TryToHexStringLower(_data, _dest, out int charsWritten);<br \/>\n}<\/p>\n<p>Method<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>Old<br \/>\n136.69 ns<br \/>\n1.00<br \/>\n848 B<br \/>\n1.00<\/p>\n<p>New<br \/>\n119.09 ns<br \/>\n0.87<br \/>\n424 B<br \/>\n0.50<\/p>\n<p>NewTry<br \/>\n21.97 ns<br \/>\n0.16<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>Before .NET 5 introduced Convert.ToHexString, there actually already was some functionality in .NET for converting bytes to hex: BitConverter.ToString. BitConverter.ToString does the same thing Convert.ToHexString now does, except inserting dashes between every two hex characters (i.e. between every byte). As a result, it became fairly common for folks that wanted the equivalent of ToHexString to instead write BitConverter.ToString(bytes).Replace(&#8220;-&#8220;, &#8220;&#8221;). It\u2019s so common to want the dashes removed, in fact, that it\u2019s what GitHub copilot suggests for me just by typing BitConverter.ToString:<\/p>\n<p>Of course, that operation is much more expensive (and more complicated) than just using Convert.ToHexString, so it\u2019d be nice to help developers switch over to ToHexString{Lower}. That\u2019s exactly what <a href=\"https:\/\/github.com\/dotnet\/roslyn-analyzers\/pull\/6967\">dotnet\/roslyn-analyzers#6967<\/a> from <a href=\"https:\/\/github.com\/mpidash\">@mpidash<\/a> does. <a href=\"https:\/\/learn.microsoft.com\/dotnet\/fundamentals\/code-analysis\/quality-rules\/ca1872\">CA1872<\/a> will now flag both cases that can be converted to Convert.ToHexString:<\/p>\n<p>and cases that can be converted to Convert.ToHexStringLower:<\/p>\n<p>And that\u2019s good for performance, as the difference is quite stark:<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private byte[] _bytes = Enumerable.Range(0, 100).Select(i =&gt; (byte) i).ToArray();<\/p>\n<p>    [Benchmark(Baseline = true)]<br \/>\n    public string WithBitConverter() =&gt; BitConverter.ToString(_bytes).Replace(&#8220;-&#8220;, &#8220;&#8221;).ToLowerInvariant();<\/p>\n<p>    [Benchmark]<br \/>\n    public string WithConvert() =&gt; Convert.ToHexStringLower(_bytes);<br \/>\n}<\/p>\n<p>Method<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>WithBitConverter<br \/>\n1,707.46 ns<br \/>\n1.00<br \/>\n1472 B<br \/>\n1.00<\/p>\n<p>WithConvert<br \/>\n61.66 ns<br \/>\n0.04<br \/>\n424 B<br \/>\n0.29<\/p>\n<p>There are a variety of reasons for that difference, including the obvious one that the Replace is having to search the input, find all the dashes, and allocate a brand new string without them. However, BitConverter.ToString is also slower in general as it\u2019s not as easily vectorized, due to needing to insert dashes between the resulting characters.<\/p>\n<p>In the other direction, Convert.FromHexString decodes a string of hex back into a new byte[]. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86556\">dotnet\/runtime#86556<\/a> from <a href=\"https:\/\/github.com\/hrrrrustic\">@hrrrrustic<\/a> adds overloads of FromHexString that write into a destination span rather than allocating a new byte[] each time.<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Buffers;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private string _hex = string.Concat(Enumerable.Repeat(&#8220;0123456789abcdef&#8221;, 10));<br \/>\n    private byte[] _dest = new byte[100];<\/p>\n<p>    [Benchmark(Baseline = true)]<br \/>\n    public byte[] FromHexString() =&gt; Convert.FromHexString(_hex);<\/p>\n<p>    [Benchmark]<br \/>\n    public OperationStatus FromHexStringSpan() =&gt; Convert.FromHexString(_hex.AsSpan(), _dest, out int charsWritten, out int bytesWritten);<br \/>\n}<\/p>\n<p>Method<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>FromHexString<br \/>\n33.78 ns<br \/>\n1.00<br \/>\n104 B<br \/>\n1.00<\/p>\n<p>FromHexStringSpan<br \/>\n18.22 ns<br \/>\n0.54<br \/>\n\u2013<br \/>\n0.00<\/p>\n<h3>Span, Span, and more Span<\/h3>\n<p>The introduction of Span&lt;T&gt; and ReadOnlySpan&lt;T&gt; back in .NET Core 2.1 have revolutionized how we write .NET code (especially in the core libraries) and what APIs we expose (see <a href=\"https:\/\/www.youtube.com\/watch?v=5KdICNWOfEQ\">A Complete .NET Developer\u2019s Guide to Span<\/a> if you\u2019re interested in a deeper dive.) .NET 9 has continued the trend of doubling-down on spans as a great way to both implicitly provide performance boosts and also expose APIs that enables developers to do more for performance in their own code.<\/p>\n<p>One great example of this is the new C# 13 support for \u201cparams collections,\u201d which merged into the C# compiler\u2019s main branch in <a href=\"https:\/\/github.com\/dotnet\/roslyn\/pull\/72511\">dotnet\/roslyn#72511<\/a>. This feature enables the C# params keyword to be used with more than just array parameters, but rather any collection type that\u2019s usable with collection expressions\u2026 that includes span. In fact, the feature makes it so that if there are two overloads, one taking a params T[] and one taking a params ReadOnlySpan&lt;T&gt;, the latter overload will win overload resolution. Moreover, the code generated for a call site for a params ReadOnlySpan&lt;T&gt; is the same non-allocating approach you get for collection expressions, e.g. given code like this:<\/p>\n<p>using System;<\/p>\n<p>public class C<br \/>\n{<br \/>\n    public void M()<br \/>\n    {<br \/>\n        Helpers.DoAwesomeStuff(&#8220;Hello&#8221;, &#8220;World&#8221;);<br \/>\n    }<br \/>\n}<\/p>\n<p>public static class Helpers<br \/>\n{<br \/>\n    public static void DoAwesomeStuff&lt;T&gt;(params T[] values) { }<br \/>\n    public static void DoAwesomeStuff&lt;T&gt;(params ReadOnlySpan&lt;T&gt; values) { }<br \/>\n}<\/p>\n<p>the IL the C# compiler generates for C.M will be equivalent to something like the following C#:<\/p>\n<p>&lt;&gt;y__InlineArray2&lt;string&gt; buffer = default;<br \/>\n&lt;PrivateImplementationDetails&gt;.InlineArrayElementRef&lt;&lt;&gt;y__InlineArray2&lt;string&gt;, string&gt;(ref buffer, 0) = &#8220;Hello&#8221;;<br \/>\n&lt;PrivateImplementationDetails&gt;.InlineArrayElementRef&lt;&lt;&gt;y__InlineArray2&lt;string&gt;, string&gt;(ref buffer, 1) = &#8220;World&#8221;;<br \/>\nHelpers.DoAwesomeStuff(&lt;PrivateImplementationDetails&gt;.InlineArrayAsReadOnlySpan&lt;&lt;&gt;y__InlineArray2&lt;string&gt;, string&gt;(ref buffer, 2));<\/p>\n<p>This is using the [InlineArray] feature introduced in .NET 8 to stack-allocate a span of strings, and then pass that span into the method. No heap allocation. This is awesome for library developers, because it means any place where you have a method taking a params T[], you can add a params ReadOnlySpan&lt;T&gt; overload, and when consuming code calling that method recompiles, it just gets better. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101308\">dotnet\/runtime#101308<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101499\">dotnet\/runtime#101499<\/a> rely on that to add ~40 new overloads for methods that didn\u2019t previously accept spans and now do, and added params to over 20 existing overloads that were already taking spans. For example, if code had been using Path.Join to build up a path comprised of five or more segments, it previously would have been using the params string[] overload, but now upon recompilation it\u2019ll switch to using the params ReadOnlySpan&lt;string&gt; overload, and won\u2019t need to allocate a string[] for the inputs.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Buffers;<br \/>\nusing System.Numerics;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public string Join() =&gt; Path.Join(&#8220;a&#8221;, &#8220;b&#8221;, &#8220;c&#8221;, &#8220;d&#8221;, &#8220;e&#8221;);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>Join<br \/>\n.NET 8.0<br \/>\n30.83 ns<br \/>\n1.00<br \/>\n104 B<br \/>\n1.00<\/p>\n<p>Join<br \/>\n.NET 9.0<br \/>\n24.85 ns<br \/>\n0.81<br \/>\n40 B<br \/>\n0.38<\/p>\n<p>The C# compiler has also improved around spans in other ways. For example, <a href=\"https:\/\/github.com\/dotnet\/roslyn\/pull\/71261\">dotnet\/roslyn#71261<\/a> extends the assembly data support for initializing arrays and ReadOnlySpan&lt;T&gt; to also apply to stackalloc. If you have code like this:<\/p>\n<p>var array = new char[] { &#8216;a&#8217;, &#8216;b&#8217;, &#8216;c&#8217;, &#8216;d&#8217;, &#8216;e&#8217;, &#8216;f&#8217;, &#8216;g&#8217; };<\/p>\n<p>the compiler will generate code along the lines of the following:<\/p>\n<p>char[] array = new char[7];<br \/>\nRuntimeHelpers.InitializeArray(array, (RuntimeFieldHandle)&amp;&lt;PrivateImplementationDetails&gt;.FD43C34A357FF620C00C04D0247059F8628CBB3DB349DF05DFA15EF6C7AC514C2);<\/p>\n<p>The compiler has taken that char data and blit it into the assembly; then when it creates the array, rather than setting each individual value into the array, it just copies that data directly from the assembly into the array. Similarly, if you have:<\/p>\n<p>ReadOnlySpan&lt;char&gt; span = new char[] { &#8216;a&#8217;, &#8216;b&#8217;, &#8216;c&#8217;, &#8216;d&#8217;, &#8216;e&#8217;, &#8216;f&#8217;, &#8216;g&#8217; };<\/p>\n<p>the compiler recognizes that all of the data is constant and is being stored into a \u201cread-only\u201d location, so it doesn\u2019t actually need to allocate an array. Instead, it emits code like:<\/p>\n<p>ReadOnlySpan&lt;char&gt; span =<br \/>\nRuntimeHelpers.CreateSpan&lt;char&gt;((RuntimeFieldHandle)&amp;&lt;PrivateImplementationDetails&gt;.FD43C34A357FF620C00C04D0247059F8628CBB3DB349DF05DFA15EF6C7AC514C2);<\/p>\n<p>which effectively creates a span that points directly into the assembly data; no allocation <em>and<\/em> no copy needed. However, if you have this:<\/p>\n<p>ReadOnlySpan&lt;char&gt; span = stackalloc char[] { &#8216;a&#8217;, &#8216;b&#8217;, &#8216;c&#8217;, &#8216;d&#8217;, &#8216;e&#8217;, &#8216;f&#8217;, &#8216;g&#8217; };<\/p>\n<p>or this:<\/p>\n<p>Span&lt;char&gt; span = stackalloc char[] { &#8216;a&#8217;, &#8216;b&#8217;, &#8216;c&#8217;, &#8216;d&#8217;, &#8216;e&#8217;, &#8216;f&#8217;, &#8216;g&#8217; };<\/p>\n<p>you\u2019d get codegen more like this:<\/p>\n<p>char* ptr = stackalloc char[7];<br \/>\n*(char*)ptr = 97;<br \/>\n*(char*)(ptr + 1) = 98;<br \/>\n*(char*)(ptr + 2) = 99;<br \/>\n*(char*)(ptr + 3) = 100;<br \/>\n*(char*)(ptr + 4) = 101;<br \/>\n*(char*)(ptr + 5) = 102;<br \/>\n*(char*)(ptr + 6) = 103;<br \/>\nSpan&lt;char&gt; span = new Span&lt;char&gt;(ptr, 7);<\/p>\n<p>But now, thanks to <a href=\"https:\/\/github.com\/dotnet\/roslyn\/pull\/71261\">dotnet\/roslyn#71261<\/a>, that last example will also be unified with the same approach for the other constructions, resulting in code more like this:<\/p>\n<p>char* ptr = stackalloc char[7];<br \/>\nUnsafe.CopyBlockUnaligned(ptr, &amp;&lt;PrivateImplementationDetails&gt;.FD43C34A357FF620C00C04D0247059F8628CBB3DB349DF05DFA15EF6C7AC514C2, 14);<br \/>\nSpan&lt;char&gt; span = new Span&lt;char&gt;(ptr, 7);<\/p>\n<p>(the compiler will actually generate a cpblk IL instruction rather than a call to Unsafe.CopyBlockUnaligned).<\/p>\n<p>The C# compiler has also improved its ability to avoid allocations when creating ReadOnlySpan&lt;T&gt; from some expressed array constructions or collection expressions. One of the really nice optimizations the C# compiler added several years back was the ability to recognize when a new byte\/sbyte\/bool array was being constructed, filled with only constants, and directly assigned to a ReadOnlySpan&lt;T&gt;. In such a case, it would recognize that the data was all blittable and could never be modified, so rather than allocating an array and wrapping a span around it, it would blit the data into the assembly and then just construct a span around a pointer into the assembly data with the appropriate length. So this:<\/p>\n<p>ReadOnlySpan&lt;byte&gt; Values =&gt; new[] { (byte)0, (byte)1, (byte)2 };<\/p>\n<p>got lowered into something more like this:<\/p>\n<p>ReadOnlySpan&lt;byte&gt; Values =&gt; new ReadOnlySpan&lt;byte&gt;(<br \/>\n    &amp;&lt;PrivateImplementationDetails&gt;.AE4B3280E56E2FAF83F414A6E3DABE9D5FBE18976544C05FED121ACCB85B53FC),<br \/>\n    3);<\/p>\n<p>The optimization at the time was limited to only single-byte primitive types because of endianness concerns, but .NET 7 added a RuntimeHelpers.CreateSpan method which handled such endianness concerns, so then this was expanded to all such primitive types regardless of size. So this:<\/p>\n<p>ReadOnlySpan&lt;char&gt; Values1 =&gt; new[] { &#8216;a&#8217;, &#8216;b&#8217;, &#8216;c&#8217; };<br \/>\nReadOnlySpan&lt;int&gt; Values2 =&gt; new[] { 1, 2, 3 };<br \/>\nReadOnlySpan&lt;long&gt; Values3 =&gt; new[] { 1L, 2, 3 };<br \/>\nReadOnlySpan&lt;DayOfWeek&gt; Values4 =&gt; new[] { DayOfWeek.Monday, DayOfWeek.Friday };<\/p>\n<p>gets lowered into something more like this:<\/p>\n<p>ReadOnlySpan&lt;byte&gt; Values1 =&gt; new ReadOnlySpan&lt;byte&gt;(<br \/>\n    &amp;&lt;PrivateImplementationDetails&gt;.13E228567E8249FCE53337F25D7970DE3BD68AB2653424C7B8F9FD05E33CAEDF2),<br \/>\n    3);<\/p>\n<p>ReadOnlySpan&lt;byte&gt; Values2 =&gt; new ReadOnlySpan&lt;byte&gt;(<br \/>\n    &amp;&lt;PrivateImplementationDetails&gt;.4636993D3E1DA4E9D6B8F87B79E8F7C6D018580D52661950EABC3845C5897A4D4),<br \/>\n    3);<\/p>\n<p>ReadOnlySpan&lt;byte&gt; Values3 =&gt; new ReadOnlySpan&lt;byte&gt;(<br \/>\n    &amp;&lt;PrivateImplementationDetails&gt;.E2E2033AE7E19D680599D4EB0A1359A2B48EC5BAAC75066C317FBF85159C54EF8),<br \/>\n    3);<\/p>\n<p>ReadOnlySpan&lt;byte&gt; Values3 =&gt; new ReadOnlySpan&lt;byte&gt;(<br \/>\n    &amp;&lt;PrivateImplementationDetails&gt;.ECA75F8497701D6223817CDE38BF42CDD1124E01EF6B705BCFE9A584F7B42F0F4),<br \/>\n    2);<\/p>\n<p>Lovely. But\u2026 what about types that are supported as constants at the C# level but that aren\u2019t blittable in this fashion? That includes nint and nuint (which vary in size based on the bitness of the process), decimal (for which a constant is actually represented in metadata via a [DecimalConstant(&#8230;)] attribute), and string (which is a reference type). In those cases, even though we\u2019re still targeting something that can be mutated and we\u2019re still using constants, we still get the array allocation:<\/p>\n<p>ReadOnlySpan&lt;nint&gt; Values1 =&gt; new nint[] { 1, 2, 3 };<br \/>\nReadOnlySpan&lt;nuint&gt; Values2 =&gt; new nuint[] { 1, 2, 3 };<br \/>\nReadOnlySpan&lt;decimal&gt; Values3 =&gt; new[] { 1m, 2m, 3m };<br \/>\nReadOnlySpan&lt;string&gt; Values4 =&gt; new[] { &#8220;a&#8221;, &#8220;b&#8221;, &#8220;c&#8221; };<\/p>\n<p>which are lowered to, well, themselves, such that there\u2019s still an allocation. Or, at least there was. Thanks to <a href=\"https:\/\/github.com\/dotnet\/roslyn\/pull\/69820\">dotnet\/roslyn#69820<\/a>, these cases are now handled as well. They\u2019re addressed by lazily allocating an array that\u2019s then cached for all subsequent use. So now, that same example gets lowered into the equivalent of something more like this:<\/p>\n<p>ReadOnlySpan&lt;nint&gt; Values1 =&gt;<br \/>\n    &lt;PrivateImplementationDetails&gt;.4636993D3E1DA4E9D6B8F87B79E8F7C6D018580D52661950EABC3845C5897A4D_B8 ??=<br \/>\n    new nint[] { 1, 2, 3 };<\/p>\n<p>ReadOnlySpan&lt;nuint&gt; Values2 =&gt;<br \/>\n    &lt;PrivateImplementationDetails&gt;.4636993D3E1DA4E9D6B8F87B79E8F7C6D018580D52661950EABC3845C5897A4D_B16 ??=<br \/>\n    new nuint[] { 1, 2, 3 };<\/p>\n<p>ReadOnlySpan&lt;decimal&gt; Values3 =&gt;<br \/>\n    &lt;PrivateImplementationDetails&gt;.04B64E80BCEFE521678C4D6565B6EEBCE2791130A600CCB5D23E1B5538155110_B18 ??=<br \/>\n    new[] { 1m, 2m, 3m };<\/p>\n<p>ReadOnlySpan&lt;string&gt; Values4 =&gt;<br \/>\n    &lt;PrivateImplementationDetails&gt;.13E228567E8249FCE53337F25D7970DE3BD68AB2653424C7B8F9FD05E33CAEDF_B11 ??=<br \/>\n    new[] { &#8220;a&#8221;, &#8220;b&#8221;, &#8220;c&#8221; };<\/p>\n<p>There are, of course, many more span-related improvements in the libraries, too. One improvement for an existing span-related method is <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103728\">dotnet\/runtime#103728<\/a>, which further optimizes MemoryExtensions.Count used to count the number of occurrences of an element in a span. The implementation is vectorized, processing a vector\u2019s worth of data at a time, e.g. if 256-bit vectors are hardware accelerated, and it\u2019s searching chars, it\u2019ll process 16 chars at a time (16 chars <em> 2 bytes per char <\/em> 8 bits per byte == 256). What happens if the number of elements isn\u2019t an even multiple of 16? Then we\u2019re left with some remaining elements after processing the last full vector. Previously the implementation would fall back to processing those remaining elements one at a time; now, it\u2019ll process one last vector at the end of the input. Doing so means we\u2019ll end up re-examining one ore more elements we already examined, but that doesn\u2019t really matter, as we can examine all of the elements in approximately the same number of instructions as processing just a single element.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Buffers;<br \/>\nusing System.Numerics;<br \/>\nusing System.Runtime.CompilerServices;<br \/>\nusing System.Runtime.InteropServices;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private char[][] _values = new char[10_000][];<\/p>\n<p>    [GlobalSetup]<br \/>\n    public void Setup()<br \/>\n    {<br \/>\n        var rng = new Random(42);<br \/>\n        for (int i = 0; i &lt; _values.Length; i++)<br \/>\n        {<br \/>\n            _values[i] = new char[rng.Next(0, 128)];<br \/>\n            rng.NextBytes(MemoryMarshal.AsBytes(_values[i].AsSpan()));<br \/>\n        }<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public int Count()<br \/>\n    {<br \/>\n        int count = 0;<br \/>\n        foreach (char[] numbers in _values)<br \/>\n        {<br \/>\n            count += numbers.AsSpan().Count(&#8216;a&#8217;);<br \/>\n        }<br \/>\n        return count;<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Count<br \/>\n.NET 8.0<br \/>\n133.25 us<br \/>\n1.00<\/p>\n<p>Count<br \/>\n.NET 9.0<br \/>\n74.30 us<br \/>\n0.56<\/p>\n<p>New span-related functionality also shows up in .NET 9. String splitting is an operation that\u2019s used all over the place; a search for \u201c.Split(\u201d in C# code on GitHub yields millions of hits, and data from a variety of sources suggests that just the simplest overload Split(params char[]? separator) is used by upwards of 90% of applications and 20% of nuget packages. So it should come as no surprise that a request to have this functionality for spans is very popular.<\/p>\n\n<p>The devil is in the details, of course, and it\u2019s taken a long time to figure out exactly how it should be exposed. There are largely two different use cases for splitting we see in the wild. One case is where the content being split has an expected or max number of segments, and splitting is used to extract them. For example, FileVersionInfo needs to be able to take a version string and parse from it up to 4 components separated by periods. .NET 8 introduced new Split extension methods on MemoryExtensions to address this use case, by having Split take a destination Span&lt;Range&gt; to write the bounds of each segment into. That, however, still leaves the second major category of usage, which is for iterating through an unbounded number of segments. A representative example there is this snippet from HttpListener\u2018s web sockets implementation:<\/p>\n<p>string[] requestProtocols = clientSecWebSocketProtocol.Split(&#8216;,&#8217;, StringSplitOptions.TrimEntries | StringSplitOptions.RemoveEmptyEntries);<br \/>\nfor (int i = 0; i &lt; requestProtocols.Length; i++)<br \/>\n{<br \/>\n    if (string.Equals(acceptProtocol, requestProtocols[i], StringComparison.OrdinalIgnoreCase))<br \/>\n    {<br \/>\n        return true;<br \/>\n    }<br \/>\n}<\/p>\n<p>The clientSecWebSocketProtocol string is composed of comma-separated values, and this is iterating through them to see if any is equal to the target acceptProtocol. It\u2019s doing that, though, with a relatively expensive operation. That Split call needs to allocate the string[] that\u2019s returned and that holds all the constituent strings, and then each segment results in a string being allocated. We can do better, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104534\">dotnet\/runtime#104534<\/a> from <a href=\"https:\/\/github.com\/bbartels\">@bbartels<\/a> enables that. It adds four new overloads of MemoryExtensions.Split and MemoryExtensions.SplitAny:<\/p>\n<p>public static SpanSplitEnumerator&lt;T&gt; Split&lt;T&gt;(this ReadOnlySpan&lt;T&gt; source, T separator) where T : IEquatable&lt;T&gt;;<br \/>\npublic static SpanSplitEnumerator&lt;T&gt; Split&lt;T&gt;(this ReadOnlySpan&lt;T&gt; source, ReadOnlySpan&lt;T&gt; separator) where T : IEquatable&lt;T&gt;;<br \/>\npublic static SpanSplitEnumerator&lt;T&gt; SplitAny&lt;T&gt;(this ReadOnlySpan&lt;T&gt; source, params ReadOnlySpan&lt;T&gt; separators) where T : IEquatable&lt;T&gt;;<br \/>\npublic static SpanSplitEnumerator&lt;T&gt; SplitAny&lt;T&gt;(this ReadOnlySpan&lt;T&gt; source, SearchValues&lt;T&gt; separators) where T : IEquatable&lt;T&gt;;<\/p>\n<p>With that, this same operation can be written as:<\/p>\n<p>foreach (Range r in clientSecWebSocketProtocol.AsSpan().Split(&#8216;,&#8217;))<br \/>\n{<br \/>\n    if (clientSecWebSocketProtocol.AsSpan(r).Trim().Equals(acceptProtocol, StringComparison.OrdinalIgnoreCase))<br \/>\n    {<br \/>\n        return true;<br \/>\n    }<br \/>\n}<\/p>\n<p>In doing so, it becomes allocation-free, as this Split doesn\u2019t need to allocate a string[] to hold results and doesn\u2019t need to allocate a string for each segment: instead, it\u2019s returning a ref struct enumerator that yields a Range representing each segment. The caller can then use that Range to slice the input. It\u2019s yielding a Range rather than, say, a ReadOnlySpan&lt;T&gt;, to enable the splitting to be used with original sources other than spans and be able to get the segments in the original form. For example, if I had a ReadOnlyMemory&lt;T&gt; and wanted to add segments from it into a list, I could do:<\/p>\n<p>ReadOnlyMemory&lt;T&gt; source = &#8230;;<br \/>\nList&lt;ReadOnlyMemory&lt;T&gt;&gt; list = &#8230;;<br \/>\nforeach (Range r in source.Split(separator))<br \/>\n{<br \/>\n    list.Add(source.Slice(r));<br \/>\n}<\/p>\n<p>whereas that wouldn\u2019t be possible if Split forced all yielded results to be spans.<\/p>\n<p>You might notice that there\u2019s no StringSplitOptions on these overloads. That\u2019s because it\u2019s both not applicable and not necessary. It\u2019s not applicable because we\u2019re working here with T, which might be something other than char, but an option like StringSplitOptions.TrimEntries implies a notion of whitespace, and that\u2019s only relevant for text. And it\u2019s not necessary, because the main benefit of StringSplitOptions, both TrimEntries and RemoveEmptyEntries, is reducing allocation overheads. If these options didn\u2019t exist with the string overloads, and you wanted to simulate them with our original example (and spans didn\u2019t exist), it would end up looking like this:<\/p>\n<p>string[] requestProtocols = clientSecWebSocketProtocol.Split(&#8216;,&#8217;);<br \/>\nfor (int i = 0; i &lt; requestProtocols.Length; i++)<br \/>\n{<br \/>\n    if (string.Equals(acceptProtocol, requestProtocols[i].Trim(), StringComparison.OrdinalIgnoreCase))<br \/>\n    {<br \/>\n        return true;<br \/>\n    }<br \/>\n}<\/p>\n<p>There are several possible performance problems here. Imagine the clientSecWebSocketProtocol input was &#8220;a , b, , , , , , c&#8221;. There are only three entries we care about here (&#8220;a&#8221;, &#8220;b&#8221;, and &#8220;c&#8221;), but the returned array is going to be a string[8] instead of a string[3], because it\u2019s going to have entries for each of those whitespace-only segments. That\u2019s a larger allocation than is necessary. Then, we\u2019ll be producing strings for all eight of those segments, even though only three of the strings were necessary. And, all of &#8220;a &#8220;, &#8221; b&#8221;, and &#8221; c&#8221; have some extraneous whitespace that needs to be trimmed, such that the following Trim() call will allocate a new string for each. The StringSplitOptions enables the implementation of Split to avoid all of that overhead, by only allocating what\u2019s desired. But with the span version, none of that allocation exists anyway. The consuming loop can trim the spans itself without incurring more overhead than would the Split implementation, and the consuming loop can choose to ignore empty entries without increasing the size of a string[] allocation.<\/p>\n<p>The net result is such operations can be significantly more efficient while not sacrificing much if anything in the way of maintainability.<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private string _input = &#8220;a , b, , , , , , c&#8221;;<br \/>\n    private string _target = &#8220;d&#8221;;<\/p>\n<p>    [Benchmark(Baseline = true)]<br \/>\n    public bool ContainsString()<br \/>\n    {<br \/>\n        foreach (string item in _input.Split(&#8216;,&#8217;, StringSplitOptions.RemoveEmptyEntries | StringSplitOptions.TrimEntries))<br \/>\n        {<br \/>\n            if (item.Equals(_target, StringComparison.OrdinalIgnoreCase))<br \/>\n            {<br \/>\n                return true;<br \/>\n            }<br \/>\n        }<\/p>\n<p>        return false;<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public bool ContainsSpan()<br \/>\n    {<br \/>\n        foreach (Range r in _input.AsSpan().Split(&#8216;,&#8217;))<br \/>\n        {<br \/>\n            if (_input.AsSpan(r).Trim().Equals(_target, StringComparison.OrdinalIgnoreCase))<br \/>\n            {<br \/>\n                return true;<br \/>\n            }<br \/>\n        }<\/p>\n<p>        return false;<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>ContainsString<br \/>\n127.26 ns<br \/>\n1.00<br \/>\n208 B<br \/>\n1.00<\/p>\n<p>ContainsSpan<br \/>\n61.89 ns<br \/>\n0.49<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>The nature of this new set of splitting APIs is that they find just the next separator \/ segment; that\u2019s both practical and possibly a performance improvement by itself. It\u2019s practical because we\u2019re only yielding a single segment at a time, and we don\u2019t have anywhere to store all possible found separator positions (nor do we want to allocate space to do so). And it\u2019s desirable because the consumer may early-exit from the consuming loop, in which case we don\u2019t want to have spent time unnecessarily searching for additional segments that are going to be ignored. The existing set of splitting APIs, however, hand back all found segments in one go, either via a returned string[] or via ranges being written to a destination span. And as such, it makes more sense for those overloads to find all separators at once because that operation can be vectorized. In fact, previous versions have done so. But, that vectorization has only benefited from 128-bit vectors. With <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/93043\">dotnet\/runtime#93043<\/a> from <a href=\"https:\/\/github.com\/khushal1996\">@khushal1996<\/a> in .NET 9, that vectorization will now light-up with 512-bit or 256-bit vectors if they\u2019re available, enabling that separator searching that happens as part of splitting to run up to four times faster.<\/p>\n<p>Spans show up in other new methods as well. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/93938\">dotnet\/runtime#93938<\/a> from <a href=\"https:\/\/github.com\/TheMaximum\">@TheMaximum<\/a> added new overloads of StringBuilder.Replace that accept ReadOnlySpan&lt;char&gt; instead of string. As is the case with most such overloads, they share the same implementation, with the string-based overloads just creating a span from the string and using a span-based implementation. In practice, the majority of use of StringBuilder.Replace uses constant strings as arguments, for example to escape some known delimiter (Replace(&#8220;$&#8221;, &#8220;\\$&#8221;)), or use previously-created string instances, such as to remove some substring from text (Replace(substring, &#8220;&#8221;)). But, there are a minority of cases where Replace is used with something that\u2019s created on the spot, and for that, these new overloads can help to avoid allocation for creating the arguments. For example, here\u2019s some escaping code used today by MSBuild:<\/p>\n<p>char[] charsToEscape = &#8230;;<br \/>\nStringBuilder escapedString = &#8230;;<br \/>\nforeach (char unescapedChar in charsToEscape)<br \/>\n{<br \/>\n    string escapedCharacterCode = string.Format(CultureInfo.InvariantCulture, &#8220;%{0:x00}&#8221;, (int)unescapedChar);<br \/>\n    escapedString.Replace(unescapedChar.ToString(CultureInfo.InvariantCulture), escapedCharacterCode);<br \/>\n}<\/p>\n<p>This is having to perform two string allocations to create the input to this Replace, which is going to be invoked for each char in charsToEscape. If charsToEscape is something fixed, it could be better to avoid these formatting operations per iteration, and instead just cache the necessary strings for all uses, e.g.<\/p>\n<p>private static readonly char[] charsToEscape = &#8230;;<br \/>\nprivate static readonly string[] escapedCharsToEscape = charsToEscape.Select(c =&gt; $&#8221;%{(uint)unescapedChar:x00}&#8221;).ToArray();<br \/>\nprivate static readonly string[] stringsToEscape = charsToEscape.Select(c =&gt; c.ToString()).ToArray();<br \/>\n&#8230;<br \/>\nfor (int i = 0; i &lt; charsToEscape.Length; i++)<br \/>\n{<br \/>\n    escapedString.Replace(stringsToEscape[i], escapedCharsToEscape[i]);<br \/>\n}<\/p>\n<p>but if charsToEscape isn\u2019t predictable, then we can at least avoid the allocation by employing the new overloads, e.g.<\/p>\n<p>char[] charsToEscape = &#8230;;<br \/>\nStringBuilder escapedString = &#8230;;<br \/>\nSpan&lt;char&gt; escapedSpan = stackalloc char[5];<br \/>\nforeach (char unescapedChar in charsToEscape)<br \/>\n{<br \/>\n    escapedSpan.TryWrite($&#8221;%{(uint)unescapedChar:x00}&#8221;, out int charsWritten);<br \/>\n    escapedString.Replace(new ReadOnlySpan&lt;char&gt;(in unescapedChar), escapedSpan.Slice(0, charsWritten));<br \/>\n}<\/p>\n<p>and, boom, no more allocation for the arguments.<\/p>\n<p>A variety of other improvements were made to string manipulation, mainly around better employing vectorization. StringComparison.OrdinalIgnoreCase operations were previously vectorized, but only with 128-bit vectors, which means handling up to 8 chars at a time. Thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/93116\">dotnet\/runtime#93116<\/a>, those code paths have been updated to support 256-bit and 512-bit vectors, which means handling up to 16 or 32 chars at a time on hardware accelerated to support it.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private static string s_s1 = &#8220;&#8221;&#8221;<br \/>\n        Let me not to the marriage of true minds<br \/>\n        Admit impediments; love is not love<br \/>\n        Which alters when it alteration finds,<br \/>\n        Or bends with the remover to remove.<br \/>\n        O no, it is an ever-fixed mark<br \/>\n        That looks on tempests and is never shaken;<br \/>\n        It is the star to every wand&#8217;ring bark<br \/>\n        Whose worth&#8217;s unknown, although his height be taken.<br \/>\n        Love&#8217;s not time&#8217;s fool, though rosy lips and cheeks<br \/>\n        Within his bending sickle&#8217;s compass come.<br \/>\n        Love alters not with his brief hours and weeks,<br \/>\n        But bears it out even to the edge of doom:<br \/>\n        If this be error and upon me proved,<br \/>\n        I never writ, nor no man ever loved.<br \/>\n        &#8220;&#8221;&#8221;;<br \/>\n    private static string s_s2 = s_s1[0..^1] + &#8220;!&#8221;;<\/p>\n<p>    [Benchmark]<br \/>\n    public bool EqualsIgnoreCase() =&gt; s_s1.Equals(s_s2, StringComparison.OrdinalIgnoreCase);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>EqualsIgnoreCase<br \/>\n.NET 8.0<br \/>\n86.79 ns<br \/>\n1.00<\/p>\n<p>EqualsIgnoreCase<br \/>\n.NET 9.0<br \/>\n20.97 ns<br \/>\n0.24<\/p>\n<p>EndsWith also gets better, for both strings and spans. Previous releases saw StartsWith become a JIT intrinsic, enabling the JIT to generate dedicated SIMD code for StartsWith in the case where it\u2019s passed a constant. Now with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98593\">dotnet\/runtime#98593<\/a>, the same thing is done for EndsWith.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser(maxDepth: 0)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    [Arguments(&#8220;helloworld.txt&#8221;)]<br \/>\n    public bool EndsWith(string path) =&gt; path.EndsWith(&#8220;.txt&#8221;, StringComparison.OrdinalIgnoreCase);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\npath<br \/>\nMean<br \/>\nRatio<br \/>\nCode Size<\/p>\n<p>EndsWith<br \/>\n.NET 8.0<br \/>\nhelloworld.txt<br \/>\n3.5006 ns<br \/>\n1.00<br \/>\n26 B<\/p>\n<p>EndsWith<br \/>\n.NET 9.0<br \/>\nhelloworld.txt<br \/>\n0.6653 ns<br \/>\n0.19<br \/>\n61 B<\/p>\n<p>More interesting to me than these nice gains is the code that was generated to achieve them. This is what the assembly for this benchmark looked like with .NET 8:<\/p>\n<p>; Tests.EndsWith(System.String)<br \/>\n       mov       rdi,rsi<br \/>\n       mov       rsi,7EE3C2D25E38<br \/>\n       mov       edx,5<br \/>\n       cmp       [rdi],edi<br \/>\n       jmp       qword ptr [7F24678663A0]; System.String.EndsWith(System.String, System.StringComparison)<br \/>\n; Total bytes of code 26<\/p>\n<p>Pretty straightforward, a bit of argument manipulation and then jumping to the actual string.EndsWith implementation. Now here\u2019s .NET 9:<\/p>\n<p>; Tests.EndsWith(System.String)<br \/>\n       push      rbp<br \/>\n       mov       rbp,rsp<br \/>\n       mov       eax,[rsi+8]<br \/>\n       cmp       eax,4<br \/>\n       jge       short M00_L00<br \/>\n       xor       ecx,ecx<br \/>\n       jmp       short M00_L01<br \/>\nM00_L00:<br \/>\n       mov       ecx,eax<br \/>\n       lea       rax,[rsi+rcx*2-8]<br \/>\n       mov       rcx,20002000200000<br \/>\n       or        rcx,[rax+0C]<br \/>\n       mov       rax,7400780074002E<br \/>\n       cmp       rcx,rax<br \/>\n       sete      cl<br \/>\n       movzx     ecx,cl<br \/>\nM00_L01:<br \/>\n       movzx     eax,cl<br \/>\n       pop       rbp<br \/>\n       ret<br \/>\n; Total bytes of code 61<\/p>\n<p>Notice there\u2019s no call to string.EndsWith in sight. That\u2019s because the JIT has implemented the EndsWith functionality here, specific to &#8220;.txt&#8221; and OrdinalIgnoreCase, in just a few instructions. The address of the string is being passed into this method in the rsi register, and the second mov instruction is grabbing its Length (which is stored 8 bytes from the start of the string object) and storing that into the eax register. It\u2019s then checking whether the string is at least 4 characters long; if it\u2019s not, it can\u2019t possibly end with &#8220;.txt&#8221; and thus it jumps to the end to return false. If it was at least 4 characters long, it then proceeds to load the last four characters of the string as a 64-bit value into rcx and OR it with the value 20002000200000. Why? It\u2019s playing the same ASCII trick we\u2019ve seen before. The &#8216;.&#8217; is not subject to casing, so we don\u2019t need to manipulate its value, and hence the 16-bits that aligns with the &#8216;.&#8217; are 0. But the other three characters all need to be comparable with both their lower-case and upper-case forms, so this is OR\u2019ing each of the three 16-bit characters with 0x2000 to produce the lower-case form. At that point, the 64-bit value can be compared against the 64-bit representation of &#8220;.txt&#8221;, which is 7400780074002E (the ASCII value for &#8216;.&#8217; is 0x2E, for &#8216;t&#8217; is 0x74, and for &#8216;x&#8217; is 0x78). Then it\u2019s just a simple matter of whether that compared equally or not.<\/p>\n<p>Finally, we\u2019ve not talked much about arrays separate from spans, but there have been improvements there as well. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102739\">dotnet\/runtime#102739<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104103\">dotnet\/runtime#104103<\/a> move more logic for array handling from native code in the runtime up into C# code in CoreLib. For example, Array.Copy has to handle a wide array of cases (pun intended), some of which can be implemented very efficiently and some of which are more laborious, and it tries to optimize the \u201csimple\u201d cases, such as whether the bits from one array can simply be memcpy\u2019d over to the other, with as little overhead as possible. Some of those cases are easy to determine, such as single-dimensional arrays having the exact same type, but other cases require more introspection, such as if one array is enums and the other array is of the underlying type of that enum. The checks to make those determinations previously lived in native code in the runtime, but as of this PR they\u2019re now implemented in C#, and in doing so, some of the overhead associated with the checks has been removed.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private DayOfWeek[] _enums = Enum.GetValues&lt;DayOfWeek&gt;();<br \/>\n    private int[] _ints = new int[7];<\/p>\n<p>    [Benchmark]<br \/>\n    public void Copy() =&gt; Array.Copy(_enums, _ints, _enums.Length);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Copy<br \/>\n.NET 8.0<br \/>\n16.25 ns<br \/>\n1.00<\/p>\n<p>Copy<br \/>\n.NET 9.0<br \/>\n11.05 ns<br \/>\n0.68<\/p>\n<p>In addition to other benefits that come from moving such logic into managed code (better maintainability, more implicitly safe code, reduced overhead from transitioning between managed and native, etc.), there\u2019s another less obvious benefit: impact on GC pause times. And that\u2019s nowhere more obvious than with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98623\">dotnet\/runtime#98623<\/a>, which moved the implementations of memset\/memcpy helpers used for core operations like Span&lt;T&gt;.Fill and Array.Copy from native to managed. Consider this C# console app:<\/p>\n<p>using System.Diagnostics;<\/p>\n<p>new Thread(() =&gt;<br \/>\n{<br \/>\n    var a = new object[1000];<br \/>\n    while (true) a.AsSpan().Fill(a);<br \/>\n})<br \/>\n{ IsBackground = true }.Start();<\/p>\n<p>var sw = new Stopwatch();<br \/>\nwhile (true)<br \/>\n{<br \/>\n    sw.Restart();<br \/>\n    for (int i = 0; i &lt; 10; i++)<br \/>\n    {<br \/>\n        GC.Collect();<br \/>\n        Thread.Sleep(15);<br \/>\n    }<br \/>\n    Console.WriteLine(sw.Elapsed.TotalSeconds);<br \/>\n}<\/p>\n<p>This is sitting in a loop that simply times how long it takes to perform 10 gen2 collections, each spaced out by ~15 ms. If each collection were free then, this loop should take ~150 ms. Since it\u2019s not free, let\u2019s round up and estimate that the loop should be around ~200 ms. Before we run the loop, though, we launch a thread that just sits in an infinite loop filling a span. That shouldn\u2019t mess with our timing loop\u2026 or should it? When I run this on .NET 8, I get values like this:<\/p>\n<p>1.0683524<br \/>\n0.8884759<br \/>\n0.8420748<br \/>\n1.1101804<br \/>\n1.2730635<\/p>\n<p>Those values are in seconds, and that\u2019s approximately 5x larger than we\u2019d predicated. Now I try on .NET 9, and I get results like this:<\/p>\n<p>0.1638237<br \/>\n0.2129748<br \/>\n0.2859566<br \/>\n0.3020449<br \/>\n0.2871952<\/p>\n<p>What happened? In order to do some of its work, the GC needs to be able to get a consistent view of the world, which is violated if things are concurrently changing out from under it. As such, it may need to temporarily suspend all threads in the process, but to do that, it needs to wait for each thread to get to a safe point, and if a thread is executing code in the runtime, that can be hard to do. In this particular case, there\u2019s a thread spending almost all of its time sitting in a call to Span&lt;T&gt;.Fill, aka memset, which was implemented in .NET 8 as a native function in the runtime; this couldn\u2019t be interrupted, and the GC would need to wait until the call returned and it could catch it before it could interrupt that thread. In .NET 9, these implementations are all in managed code, and the GC can trivially get the threads to a safe point.<\/p>\n<h2>Collections<\/h2>\n<h3>LINQ<\/h3>\n<p>Language Integrated Query, or LINQ, is a mainstay of .NET. At its heart, LINQ is a specification for hundreds of overloads of methods that manipulate data, and then implementations of that specification for different types. One of the most prominent implementations comes from System.Linq.Enumerable, sometimes referred to as \u201cLINQ to Objects,\u201d which provides an implementation of these operations as methods for working with IEnumerable&lt;T&gt;. It\u2019s an incredibly useful set of operations, used ubiquitously, and thus it\u2019s a common target for performance optimization. In many .NET releases, it\u2019ll get a new additional method here or an optimized method there, a trickle of focused improvements. But in .NET 9, it\u2019s received a huge amount of attention, with some improvements localized to particular methods and others broadly applicable across much of the surface area.<\/p>\n<p>One of the more sweeping LINQ changes in .NET 9 has to do with how various optimizations are implemented. In the original implementation of LINQ circa 2007, almost every method was logically independent from every other. A method like SelectMany took in an IEnumerable&lt;TSource&gt; and didn\u2019t know anything about where that input came from; every enumerable was processed the same as every other. Some methods would special-case more optimizable data types, though, for example ToArray would check whether the incoming IEnumerable&lt;TSource&gt; implemented ICollection&lt;TSource&gt;, preferring if it did to use the collection\u2019s Count and CopyTo in order to avoid having to MoveNext\/Current through the whole input. But a couple of methods, in particular some overloads of Select and Where, did something more interesting. Much of LINQ was implemented using the C# compiler\u2019s support for iterators, where a method that returns IEnumerable&lt;T&gt; can use yield return t; to produce instances of T, and the compiler handles rewriting that method into a class that implements IEnumerable&lt;T&gt; and handles all the gnarly state-machine details for you. These few Select and Where overloads, however, didn\u2019t use iterators, with the developer that authored them instead preferring to write a custom enumerable by hand. Why? It\u2019s possible to hand-author an ever-so-slightly more efficient implementation in some cases, but the compiler is actually really good at doing it well, so that\u2019s not the reason. The reason is because it a) gives the type a name that can be referred to elsewhere in the code, and b) it allows that type to expose state that other code can interrogate. This enables information to flow from one query operator to the next. So, for example, Where could return a WhereEnumerableIterator instance:<\/p>\n<p>class WhereEnumerableIterator&lt;TSource&gt; : Iterator&lt;TSource&gt;<br \/>\n{<br \/>\n    IEnumerable&lt;TSource&gt; source;<br \/>\n    Func&lt;TSource, bool&gt; predicate;<br \/>\n    &#8230;<br \/>\n}<\/p>\n<p>And then Select can look for that type, or, rather, its base type, Iterator&lt;TSource&gt;:<\/p>\n<p>public static IEnumerable&lt;TResult&gt; Select&lt;TSource, TResult&gt;(this IEnumerable&lt;TSource&gt; source, Func&lt;TSource, TResult&gt; selector) {<br \/>\n    if (source == null) throw Error.ArgumentNull(&#8220;source&#8221;);<br \/>\n    if (selector == null) throw Error.ArgumentNull(&#8220;selector&#8221;);<br \/>\n    if (source is Iterator&lt;TSource&gt;) return ((Iterator&lt;TSource&gt;)source).Select(selector);<br \/>\n    &#8230;<br \/>\n}<\/p>\n<p>and that WhereEnumerableIterator&lt;TSource&gt; can override that virtual Select method on Iterator&lt;TSource&gt; to specialize what happens when a Where is followed by Select:<\/p>\n<p>public override IEnumerable&lt;TResult&gt; Select&lt;TResult&gt;(Func&lt;TSource, TResult&gt; selector) {<br \/>\n    return new WhereSelectEnumerableIterator&lt;TSource, TResult&gt;(source, predicate, selector);<br \/>\n}<\/p>\n<p>This is useful because it allows for avoiding one of the major sources of overhead with enumerables. Without this optimization, if I had source.Where(x =&gt; true).Select(x =&gt; x), the resulting enumerable would be for the Select, which would in turn wrap the enumerable for the Where, which would in turn wrap the source enumerable. That means that when I call MoveNext on the select iterator, it in turn needs to call MoveNext on the Where, which will in turn call MoveNext on the source, and then the same for Current. That means for each element in the source, we end up making 6 interface calls. With the cited optimization, we no longer have separate iterators for the Select and Where. Those end up being combined into a single iterator that does the work of both, eliminating one level from the call chain, so instead of 6 interface calls per element, there\u2019s only 4. (See <a href=\"https:\/\/www.youtube.com\/watch?v=xKr96nIyCFM\">Deep Dive on LINQ<\/a> and <a href=\"https:\/\/www.youtube.com\/watch?v=W4-NVVNwCWs\">An event DEEPER Dive into LINQ<\/a> for a more in-depth exploration of how exactly this works.)<\/p>\n<p>Over the last decade with .NET, those optimizations have been significantly extended, and in some cases to much greater benefit than just saving a few interface calls. For example, in a previous .NET release, a similar mechanism was used to special-case OrderBy followed by First. Without special-casing, the OrderBy would need to do both a full copy of the input source and an O(N log N) sort of the data, all as part of the first call to MoveNext from First. But with the optimization, First is able to see that its source is that OrderBy, in which case it doesn\u2019t need a copy or sort at all, and can instead simply do an O(N) search of OrderBy\u2018s source for its minimum value. That difference can yield monstrous wins.<\/p>\n<p>This additional special-casing was achieved with internal interfaces in the library. An IIListProvider&lt;TElement&gt; provided ToArray, ToList, and GetCount methods, and an IPartition&lt;TElement&gt; interface (which inherited IIListProvider&lt;TElement&gt;) provided additional methods like Skip, Take, and TryGetFirst. Custom iterators used to back various LINQ methods could then also implement one or more of these interfaces to specialize being followed by a call like ToArray or Count(). For example, it\u2019s very common (e.g. as part of \u201cpaging\u201d) to see call sequences like .Skip(&#8230;).Take(&#8230;); with these optimizations, those two operations can be consolidated down into a single iterator, and if it were followed by an operation like Last() or ToList(), those could see through both operators to the source in order to possibly optimize based on it (e.g. if the source were an array, Last() could calculate the exact element to return without needing to do any iteration at all).<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98969\">dotnet\/runtime#98969<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99344\">dotnet\/runtime#99344<\/a> remove those internal interfaces and consolidate all of their members down to the base Iterator&lt;TSource&gt; type. This has a variety of benefits. Not directly related to performance, it simplifies the code base, making it easier to maintain (and easier to maintain code is also generally easier to optimize); the interface members of IPartition&lt;TElement&gt; became virtual methods on the base class, which also resulted in some code reduction due to being able to share the same default implementation (though with the introduction of default interface methods a few releases ago, this <em>could<\/em> have been done separately without this consolidation). On the performance front, though, there are three main benefits of this PR:<\/p>\n<p>Virtual dispatch is generally a bit cheaper than interface dispatch. All of those interface methods became virtual methods, enabling all call sites to them to be a bit cheaper.<br \/>\nIn various places, type tests were being done for multiple targets, and those could now be consolidated to reduce type checks. For example, Select looked something like this:<br \/>\nif (source is Iterator&lt;TSource&gt; iterator)<br \/>\n{<br \/>\n    &#8230;<br \/>\n}<\/p>\n<p>if (source is IPartition partition)<br \/>\n{<br \/>\n    &#8230;<br \/>\n}<\/p>\n<p>That means for non-specialized iterators, Select was incurring a type check for Iterator&lt;TSource&gt; and an interface check for IPartition&lt;TSource&gt;. With this change, the latter check is now removed.<\/p>\n<p>Some types were only inheriting from the base class but not implementing any of the interfaces, some were implementing an interface but not the other, some were even implementing one of the interfaces but not deriving from the base class. The new approach makes it such that all of the provided virtual methods are implemented by any iterator deriving from the base class.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97905\">dotnet\/runtime#97905<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97956\">dotnet\/runtime#97956<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98874\">dotnet\/runtime#98874<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99216\">dotnet\/runtime#99216<\/a> also added more implementations.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private IEnumerable&lt;int&gt; _arrayDistinct = Enumerable.Range(0, 1000).ToArray().Distinct();<br \/>\n    private IEnumerable&lt;int&gt; _appendSelect = Enumerable.Range(0, 1000).ToArray().Append(42).Select(i =&gt; i * 2);<br \/>\n    private IEnumerable&lt;int&gt; _rangeReverse = Enumerable.Range(0, 1000).Reverse();<br \/>\n    private IEnumerable&lt;int&gt; _listDefaultIfEmptySelect = Enumerable.Range(0, 1000).ToList().DefaultIfEmpty().Select(i =&gt; i * 2);<br \/>\n    private IEnumerable&lt;int&gt; _listSkipTake = Enumerable.Range(0, 1000).ToList().Skip(500).Take(100);<br \/>\n    private IEnumerable&lt;int&gt; _rangeUnion = Enumerable.Range(0, 1000).Union(Enumerable.Range(500, 1000));<\/p>\n<p>    [Benchmark] public int DistinctFirst() =&gt; _arrayDistinct.First();<br \/>\n    [Benchmark] public int AppendSelectLast() =&gt; _appendSelect.Last();<br \/>\n    [Benchmark] public int RangeReverseCount() =&gt; _rangeReverse.Count();<br \/>\n    [Benchmark] public int DefaultIfEmptySelectElementAt() =&gt; _listDefaultIfEmptySelect.ElementAt(999);<br \/>\n    [Benchmark] public int ListSkipTakeElementAt() =&gt; _listSkipTake.ElementAt(99);<br \/>\n    [Benchmark] public int RangeUnionFirst() =&gt; _rangeUnion.First();<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>DistinctFirst<br \/>\n.NET 8.0<br \/>\n49.844 ns<br \/>\n1.00<br \/>\n328 B<br \/>\n1.00<\/p>\n<p>DistinctFirst<br \/>\n.NET 9.0<br \/>\n7.928 ns<br \/>\n0.16<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>AppendSelectLast<br \/>\n.NET 8.0<br \/>\n3,668.347 ns<br \/>\n1.000<br \/>\n144 B<br \/>\n1.00<\/p>\n<p>AppendSelectLast<br \/>\n.NET 9.0<br \/>\n2.222 ns<br \/>\n0.001<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>RangeReverseCount<br \/>\n.NET 8.0<br \/>\n8.703 ns<br \/>\n1.00<br \/>\n\u2013<br \/>\nNA<\/p>\n<p>RangeReverseCount<br \/>\n.NET 9.0<br \/>\n3.465 ns<br \/>\n0.40<br \/>\n\u2013<br \/>\nNA<\/p>\n<p>DefaultIfEmptySelectElementAt<br \/>\n.NET 8.0<br \/>\n2,772.283 ns<br \/>\n1.000<br \/>\n144 B<br \/>\n1.00<\/p>\n<p>DefaultIfEmptySelectElementAt<br \/>\n.NET 9.0<br \/>\n4.399 ns<br \/>\n0.002<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>ListSkipTakeElementAt<br \/>\n.NET 8.0<br \/>\n3.699 ns<br \/>\n1.00<br \/>\n\u2013<br \/>\nNA<\/p>\n<p>ListSkipTakeElementAt<br \/>\n.NET 9.0<br \/>\n2.103 ns<br \/>\n0.57<br \/>\n\u2013<br \/>\nNA<\/p>\n<p>RangeUnionFirst<br \/>\n.NET 8.0<br \/>\n53.670 ns<br \/>\n1.00<br \/>\n344 B<br \/>\n1.00<\/p>\n<p>RangeUnionFirst<br \/>\n.NET 9.0<br \/>\n5.181 ns<br \/>\n0.10<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>Subsequent PRs also further benefited from this consolidation. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99218\">dotnet\/runtime#99218<\/a>, for example, uses it to improve Enumerable.Any(IEnumerable&lt;T&gt;). Any just needs to say whether the source has any elements, and it tries hard to determine that without having to get an enumerator from the source, which allocates, and call MoveNext (an interface call) to see if it returns true. In .NET 8, it was doing this using Enumerable.TryGetNonEnumeratedCount, which uses Iterator&lt;T&gt;.GetCount(onlyIfCheap: true) (the \u201conlyIfCheap\u201d part basically means \u201cdon\u2019t enumerate to compute the count\u201d). However, for iterators where it\u2019s not \u201ccheap\u201d, TryGetNonEnumeratedCount would return false, and Any would still be forced to get an enumerator. However, now that every Iterator&lt;T&gt; provides a TryGetFirst, Any can use that in the case where the source is an Iterator&lt;T&gt; but GetCount isn\u2019t successful. Worst case, TryGetFirst will itself end up calling GetEnumerator, but best case, the iterator will have provided a more efficient implementation of TryGetFirst. And either way, it\u2019s still a win, because enumerating would require not only a GetEnumerator call on the Iterator&lt;T&gt;, but that in turn would need to call GetEnumerator&lt;T&gt; on whatever source it was wrapping, whereas this ends up saving one layer.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private IEnumerable&lt;int&gt; _data1 = Iterations(100).Where(i =&gt; i % 2 == 0).Select(i =&gt; i);<br \/>\n    private IEnumerable&lt;int&gt; _data2 = Enumerable.Range(0, 100).ToArray().Where(i =&gt; i % 2 == 0).Select(i =&gt; i);<\/p>\n<p>    [Benchmark] public bool Any1() =&gt; _data1.Any();<br \/>\n    [Benchmark] public bool Any2() =&gt; _data2.Any();<\/p>\n<p>    private static IEnumerable&lt;int&gt; Iterations(int count)<br \/>\n    {<br \/>\n        for (int i = 0; i &lt; count; i++) yield return i;<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>Any1<br \/>\n.NET 8.0<br \/>\n31.967 ns<br \/>\n1.00<br \/>\n104 B<br \/>\n1.00<\/p>\n<p>Any1<br \/>\n.NET 9.0<br \/>\n15.818 ns<br \/>\n0.49<br \/>\n40 B<br \/>\n0.38<\/p>\n<p>Any2<br \/>\n.NET 8.0<br \/>\n21.062 ns<br \/>\n1.00<br \/>\n56 B<br \/>\n1.00<\/p>\n<p>Any2<br \/>\n.NET 9.0<br \/>\n3.780 ns<br \/>\n0.18<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>Another cross-cutting improvement across LINQ comes in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96602\">dotnet\/runtime#96602<\/a> and has to do with empty inputs. It\u2019s also a nice example of how what\u2019s considered an optimization ebbs and flows. In the beginning of LINQ, Enumerable.Empty&lt;T&gt;(), which is strongly-typed to return IEnumerable&lt;T&gt;, returned an empty T[] as the actual instance. When Array.Empty&lt;T&gt;() was introduced, it used that. Then, however, the aforementioned IPartition&lt;T&gt; was introduced internally in LINQ, and Enumerable.Empty&lt;T&gt;() was changed to return a singleton EmptyPartition&lt;T&gt;, an implementation of the interface with all of the methods dedicated to being efficient for empty inputs. This was helpful internally as an implementation detail, as methods that were typed to return IPartition&lt;T&gt; could return that EmptyPartition&lt;T&gt; instance, whereas they couldn\u2019t return a T[], since it doesn\u2019t implement that interface. However, it had a downside. A variety of APIs can optimize very well if they know the input is empty, e.g. a Take call can immediately return an empty singleton if it knows the input is empty. But, it can\u2019t be based solely on whether it\u2019s empty <em>now<\/em>, but rather if it\u2019s empty now and for always; otherwise, you could call Take, it would see it was empty, then elements get added to the source, and then you call GetEnumerator on the enumerable returned from Take\u2026 according to the rules for how all of this behaves, that should yield the newly-added elements, but if Take had returned an empty singleton, it wouldn\u2019t. There are a variety of types that we know will always be empty once seen as empty (e.g. ImmutableArray&lt;T&gt;, T[], FrozenSet&lt;T&gt;, etc.), but it\u2019d be too costly to check for each of them individually. Instead, the implementation just picked the same type as Enumerable.Empty&lt;T&gt;() returned as the one to check for. That\u2019s fairly reasonable, but as it turns out, when that type is EmptyPartition&lt;T&gt;, there are a lot of empty arrays that are no longer noticed as being a special empty input. This gets even worse with collection expressions in the picture, as initializing an IEnumerable&lt;T&gt; with [] will, as an implementation detail, produce Array.Empty&lt;T&gt;(). So, this PR put everything back on a plan of Enumerable.Empty&lt;T&gt;() being Array.Empty&lt;T&gt;() and a T[0] being what\u2019s checked for when special-casing empty inputs. The PR also included new checks for empty in many different places that warranted it.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private string[] _values = [];<\/p>\n<p>    [Benchmark] public object Chunk() =&gt; _values.Chunk(10);<br \/>\n    [Benchmark] public object Distinct() =&gt; _values.Distinct();<br \/>\n    [Benchmark] public object GroupJoin() =&gt; _values.GroupJoin(_values, i =&gt; i, i =&gt; i, (i, j) =&gt; i);<br \/>\n    [Benchmark] public object Join() =&gt; _values.Join(_values, i =&gt; i, i =&gt; i, (i, j) =&gt; i);<br \/>\n    [Benchmark] public object ToLookup() =&gt; _values.ToLookup(i =&gt; i);<br \/>\n    [Benchmark] public object Reverse() =&gt; _values.Reverse();<br \/>\n    [Benchmark] public object SelectIndex() =&gt; _values.Select((s, i) =&gt; i);<br \/>\n    [Benchmark] public object SelectMany() =&gt; _values.SelectMany(i =&gt; i);<br \/>\n    [Benchmark] public object SkipWhile() =&gt; _values.SkipWhile(i =&gt; true);<br \/>\n    [Benchmark] public object TakeWhile() =&gt; _values.TakeWhile(i =&gt; true);<br \/>\n    [Benchmark] public object WhereIndex() =&gt; _values.Where((s, i) =&gt; true);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>Chunk<br \/>\n.NET 8.0<br \/>\n10.7213 ns<br \/>\n1.00<br \/>\n72 B<br \/>\n1.00<\/p>\n<p>Chunk<br \/>\n.NET 9.0<br \/>\n4.1320 ns<br \/>\n0.39<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>Distinct<br \/>\n.NET 8.0<br \/>\n9.4410 ns<br \/>\n1.00<br \/>\n64 B<br \/>\n1.00<\/p>\n<p>Distinct<br \/>\n.NET 9.0<br \/>\n0.7162 ns<br \/>\n0.08<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>GroupJoin<br \/>\n.NET 8.0<br \/>\n22.4746 ns<br \/>\n1.00<br \/>\n144 B<br \/>\n1.00<\/p>\n<p>GroupJoin<br \/>\n.NET 9.0<br \/>\n1.1356 ns<br \/>\n0.05<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>Join<br \/>\n.NET 8.0<br \/>\n18.6332 ns<br \/>\n1.00<br \/>\n168 B<br \/>\n1.00<\/p>\n<p>Join<br \/>\n.NET 9.0<br \/>\n1.3585 ns<br \/>\n0.07<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>ToLookup<br \/>\n.NET 8.0<br \/>\n23.3518 ns<br \/>\n1.00<br \/>\n128 B<br \/>\n1.00<\/p>\n<p>ToLookup<br \/>\n.NET 9.0<br \/>\n0.9539 ns<br \/>\n0.04<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>Reverse<br \/>\n.NET 8.0<br \/>\n9.5791 ns<br \/>\n1.00<br \/>\n48 B<br \/>\n1.00<\/p>\n<p>Reverse<br \/>\n.NET 9.0<br \/>\n0.9947 ns<br \/>\n0.10<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>SelectIndex<br \/>\n.NET 8.0<br \/>\n11.1235 ns<br \/>\n1.00<br \/>\n72 B<br \/>\n1.00<\/p>\n<p>SelectIndex<br \/>\n.NET 9.0<br \/>\n0.5603 ns<br \/>\n0.05<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>SelectMany<br \/>\n.NET 8.0<br \/>\n10.7537 ns<br \/>\n1.00<br \/>\n64 B<br \/>\n1.00<\/p>\n<p>SelectMany<br \/>\n.NET 9.0<br \/>\n0.9906 ns<br \/>\n0.09<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>SkipWhile<br \/>\n.NET 8.0<br \/>\n11.2900 ns<br \/>\n1.00<br \/>\n72 B<br \/>\n1.00<\/p>\n<p>SkipWhile<br \/>\n.NET 9.0<br \/>\n1.0988 ns<br \/>\n0.10<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>TakeWhile<br \/>\n.NET 8.0<br \/>\n11.8818 ns<br \/>\n1.00<br \/>\n72 B<br \/>\n1.00<\/p>\n<p>TakeWhile<br \/>\n.NET 9.0<br \/>\n1.0381 ns<br \/>\n0.09<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>WhereIndex<br \/>\n.NET 8.0<br \/>\n11.1751 ns<br \/>\n1.00<br \/>\n80 B<br \/>\n1.00<\/p>\n<p>WhereIndex<br \/>\n.NET 9.0<br \/>\n1.2185 ns<br \/>\n0.11<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98963\">dotnet\/runtime#98963<\/a> also has to do with emptiness, but actually improves non-empty cases. DefaultIfEmpty needs to produce an IEnumerable&lt;T&gt; containing all of the elements from the source, or if the source is empty, an enumerable with a single default(T) value. In most cases, that means it has to allocate a new enumerable, because for the same reasons as just described, it can\u2019t know until GetEnumerator is called whether the source is empty. Except, it can if the source is a T[], which has an immutable length. This PR thus special-cases arrays, which are very common, such that if the array isn\u2019t empty, it\u2019s just returned directly rather than allocating a wrapper enumerable for it. That\u2019s more than just about avoiding an allocation: that wrapper object would be in the middle of all subsequent iterations of the object, so avoiding it not only avoids an allocation but also a layer of interface calls. And for subsequent code paths that special-case arrays, the result of DefaultIfEmpty will still be seen as an T[] and thus now special-cased, whereas it wouldn\u2019t if it were wrapped.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private int[] _data = Enumerable.Range(0, 1000).ToArray();<\/p>\n<p>    [Benchmark]<br \/>\n    public double Average() =&gt; _data.DefaultIfEmpty().Average();<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>Average<br \/>\n.NET 8.0<br \/>\n1,915.4 ns<br \/>\n1.00<br \/>\n80 B<br \/>\n1.00<\/p>\n<p>Average<br \/>\n.NET 9.0<br \/>\n117.6 ns<br \/>\n0.06<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>Another change taking advantage of emptiness is <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99256\">dotnet\/runtime#99256<\/a>, this time for Enumerable.Chunk. Chunk(int size) creates an IEnumerable&lt;T[]&gt; that pages through the input size elements at a time. Normally, this requires iterating through the source and buffering until size elements have been reached, then yielding an array with those elements, and then rinsing and repeating. With an array input, we could do this much more efficiently, as we could just do math to compute the right bounds for each set to be yielded and do an efficient copy of the elements, rather than iterating through each element one by one. And while it might not be worth adding a specialized check for array here (Chunk isn\u2019t exactly a high-performance method to begin with, given it\u2019s allocating a new array for each set), as it turns out we now have a check for array, as part of determining whether the source is permanently empty. This PR just leverages that check to take advantage of both answers. If the array is empty, then it still just returns an empty array. But if it\u2019s not empty, rather than falling back to the normal iteration path, it employs a 7-line alternative that\u2019s specialized to arrays and much more efficient.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private int[] _values = Enumerable.Range(0, 1000).ToArray();<\/p>\n<p>    [Benchmark]<br \/>\n    public int Count()<br \/>\n    {<br \/>\n        int count = 0;<br \/>\n        foreach (var chunk in _values.Chunk(10)) count += chunk.Length;<br \/>\n        return count;<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Count<br \/>\n.NET 8.0<br \/>\n3.612 us<br \/>\n1.00<\/p>\n<p>Count<br \/>\n.NET 9.0<br \/>\n1.334 us<br \/>\n0.37<\/p>\n<p>The statement about checking for empty now and permanently applies in particular to methods that accept and return enumerables. It\u2019s the laziness of these methods that makes that relevant. There is, however, a set of LINQ methods that are not lazy because they produce things that aren\u2019t enumerables, such as ToArray returning an array, Sum returning a single value, Count returning an int, and so on. These methods also received attention, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102884\">dotnet\/runtime#102884<\/a> from <a href=\"https:\/\/github.com\/neon-sunset\">@neon-sunset<\/a>. One of the optimizations applied in various LINQ methods is to special-case input types that are super common, in particular T[] and List&lt;T&gt;. These can be special-cased not just as IList&lt;T&gt;, which would generally be more efficient than enumerating an input via an IEnumerator&lt;T&gt;, but rather as a ReadOnlySpan&lt;T&gt;, which can be iterated through very efficiently. This PR extended that optimization to apply to most of these other non-enumerable producing methods, in particular overloads of Any, All, Count, First, and Single that take predicates. This is particularly helpful because recent additions to analyzers have resulted in developers being told about opportunities to simplify their LINQ usage. <a href=\"https:\/\/learn.microsoft.com\/dotnet\/fundamentals\/code-analysis\/style-rules\/ide0120\">IDE0120<\/a> flags code like source.Where(predicate).First() and instead recommends simplifying it to source.First(predicate). And while that is a nice simplification and is likely to reduce allocation, Where is considerably more optimized than First(predicate) has been, with the former having special-casing for T[] and List&lt;T&gt; but the latter historically not. That difference is now addressed in .NET 9.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private IEnumerable&lt;int&gt; _list = Enumerable.Range(0, 1000).ToList();<\/p>\n<p>    [Benchmark] public bool Any() =&gt; _list.Any(i =&gt; i == 1000);<br \/>\n    [Benchmark] public bool All() =&gt; _list.All(i =&gt; i &gt;= 0);<br \/>\n    [Benchmark] public int Count() =&gt; _list.Count(i =&gt; i == 0);<br \/>\n    [Benchmark] public int First() =&gt; _list.First(i =&gt; i == 999);<br \/>\n    [Benchmark] public int Single() =&gt; _list.Single(i =&gt; i == 0);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>Any<br \/>\n.NET 8.0<br \/>\n1,553.3 ns<br \/>\n1.00<br \/>\n40 B<br \/>\n1.00<\/p>\n<p>Any<br \/>\n.NET 9.0<br \/>\n222.2 ns<br \/>\n0.14<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>All<br \/>\n.NET 8.0<br \/>\n1,586.0 ns<br \/>\n1.00<br \/>\n40 B<br \/>\n1.00<\/p>\n<p>All<br \/>\n.NET 9.0<br \/>\n224.9 ns<br \/>\n0.14<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>Count<br \/>\n.NET 8.0<br \/>\n1,535.6 ns<br \/>\n1.00<br \/>\n40 B<br \/>\n1.00<\/p>\n<p>Count<br \/>\n.NET 9.0<br \/>\n244.6 ns<br \/>\n0.16<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>First<br \/>\n.NET 8.0<br \/>\n1,600.7 ns<br \/>\n1.00<br \/>\n40 B<br \/>\n1.00<\/p>\n<p>First<br \/>\n.NET 9.0<br \/>\n245.4 ns<br \/>\n0.15<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>Single<br \/>\n.NET 8.0<br \/>\n1,550.6 ns<br \/>\n1.00<br \/>\n40 B<br \/>\n1.00<\/p>\n<p>Single<br \/>\n.NET 9.0<br \/>\n239.4 ns<br \/>\n0.15<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97004\">dotnet\/runtime#97004<\/a> from <a href=\"https:\/\/github.com\/neon-sunset\">@neon-sunset<\/a> uses that same mechanism to improve performance for List&lt;T&gt; inputs inside of Enumerable.SequenceEqual. Enumerable.SequenceEqual already had a special-case that checked whether both inputs were arrays, and if they were, it created spans from those arrays and delegated to MemoryExtensions.SequenceEquals, which will efficiently iterate through the spans, vectorizing if possible. This PR just tweaked that special-case to use the same helper that\u2019s used elsewhere to try to get a span from the source, and that gives this super power to List&lt;T&gt; as well.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private IEnumerable&lt;int&gt; _source1, _source2;<\/p>\n<p>    [GlobalSetup]<br \/>\n    public void Setup()<br \/>\n    {<br \/>\n        _source1 = Enumerable.Range(0, 10_000).ToArray();<br \/>\n        _source2 = _source1.ToList();<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public bool SequenceEqual() =&gt; _source1.SequenceEqual(_source2);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>SequenceEqual<br \/>\n.NET 8.0<br \/>\n26,623.3 ns<br \/>\n1.00<\/p>\n<p>SequenceEqual<br \/>\n.NET 9.0<br \/>\n913.4 ns<br \/>\n0.03<\/p>\n<p>ToArray and ToList were generally improved via a variety of PRs, in particular by <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96570\">dotnet\/runtime#96570<\/a>. ToArray in particular is used so ubiquitously that over the years, many folks have attempted to optimize it. In doing so, however, it\u2019s gotten too complex for its own good. This PR takes advantage of newer runtime capabilities to significantly simplify the implementation, while also improving common case performance. The easy cases were already handled well and continue to be: if the source is an ICollection&lt;T&gt;, its Count \/ CopyTo methods can be used to provide a very efficient ToArray, and if the source is an Iterator&lt;TSource&gt;, ToArray just delegates to the iterator\u2019s ToArray implementation. The challenge, instead, is in efficiently handling the case where we\u2019re dealing with an IEnumerable&lt;T&gt; of unknown length, needing to handle both short and long inputs, and doing so in a way that minimizes allocation and maximizes throughput. To achieve that, this PR deleted the internal ArrayBuilder, LargeArrayBuilder, and SparseArrayBuilder types that were previously being used and replaced them all with a simpler internal SegmentedArrayBuilder. The builder is seeded with an [InlineArray]-based struct that\u2019s large enough to hold eight T instances. For up to eight items, the builder can simply use that stack-based space to store the elements. For more than eight items, the builder contains another [InlineArray] of up to 27 T[]s. The arrays stored in there are rented from the ArrayPool&lt;T&gt;, and based on the starting size and standard doubling growth algorithm, 27 arrays is enough to store Array.MaxLength elements. This approach means that small inputs never need to allocate (other than for the final T[], which is unavoidable as it\u2019s the whole purpose of the method), and larger inputs can use ArrayPool&lt;T&gt; arrays without having to copy while growing, leading to on average significantly less allocation than before and generally faster throughput. There are trade-offs to this approach when compared to the previous one, with a few niche corner cases it doesn\u2019t handle quite as efficiently, but on the whole it\u2019s an improvement in both performance and maintainability.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    [Arguments(1)]<br \/>\n    [Arguments(8)]<br \/>\n    [Arguments(500)]<br \/>\n    public string[] IteratorToArray(int count) =&gt; GetItems(count).ToArray();<\/p>\n<p>    private IEnumerable&lt;string&gt; GetItems(int count)<br \/>\n    {<br \/>\n        for (int i = 0; i &lt; count; i++)<br \/>\n        {<br \/>\n            yield return &#8220;.NET 9&#8221;;<br \/>\n        }<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\ncount<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>IteratorToArray<br \/>\n.NET 8.0<br \/>\n1<br \/>\n65.51 ns<br \/>\n1.00<br \/>\n136 B<br \/>\n1.00<\/p>\n<p>IteratorToArray<br \/>\n.NET 9.0<br \/>\n1<br \/>\n41.39 ns<br \/>\n0.63<br \/>\n80 B<br \/>\n0.59<\/p>\n<p>IteratorToArray<br \/>\n.NET 8.0<br \/>\n8<br \/>\n103.30 ns<br \/>\n1.00<br \/>\n192 B<br \/>\n1.00<\/p>\n<p>IteratorToArray<br \/>\n.NET 9.0<br \/>\n8<br \/>\n74.66 ns<br \/>\n0.72<br \/>\n136 B<br \/>\n0.71<\/p>\n<p>IteratorToArray<br \/>\n.NET 8.0<br \/>\n500<br \/>\n3,100.69 ns<br \/>\n1.00<br \/>\n8536 B<br \/>\n1.00<\/p>\n<p>IteratorToArray<br \/>\n.NET 9.0<br \/>\n500<br \/>\n3,080.31 ns<br \/>\n0.99<br \/>\n4072 B<br \/>\n0.48<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104365\">dotnet\/runtime#104365<\/a> from <a href=\"https:\/\/github.com\/andrewjsaid\">@andrewjsaid<\/a> followed-up on this to use that same SegmentedArrayBuilder to improve ToList. Everything stays the same, except for the last step of constructing the final collection to be returned: rather than allocating an array, it allocates a List&lt;T&gt; and uses the CollectionsMarshal.SetCount method to set both the Capacity and Count of the list to the desired size, then copies the elements directly into the backing array for the list, thanks to CollectionsMarshal.AsSpan. ToList was also improved in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86796\">dotnet\/runtime#86796<\/a> from <a href=\"https:\/\/github.com\/brantburnett\">@brantburnett<\/a>. In various Iterator&lt;T&gt;.ToList specializations, the common pattern is to use List&lt;T&gt;.Add to fill in the resulting collection. This PR used a similar approach as with the previous PR, using a combination of CollectionsMarshal.SetCount and CollectionsMarshal.AsSpan to get a span for the list and directly write into the span. This saves some of the overhead from List&lt;T&gt;.Add, including bounds checks that would otherwise occur when writing to its backing array.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public List&lt;int&gt; IteratorSelectToList() =&gt; GetItems(8).Select(i =&gt; i).ToList();<\/p>\n<p>    [Benchmark]<br \/>\n    public List&lt;int&gt; IteratorWhereSelectToList() =&gt; GetItems(8).Where(i =&gt; true).Select(i =&gt; i).ToList();<\/p>\n<p>    private IEnumerable&lt;int&gt; GetItems(int count)<br \/>\n    {<br \/>\n        for (int i = 0; i &lt; count; i++)<br \/>\n        {<br \/>\n            yield return i;<br \/>\n        }<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>IteratorSelectToList<br \/>\n.NET 8.0<br \/>\n75.14 ns<br \/>\n1.00<br \/>\n224 B<br \/>\n1.00<\/p>\n<p>IteratorSelectToList<br \/>\n.NET 9.0<br \/>\n67.50 ns<br \/>\n0.90<br \/>\n184 B<br \/>\n0.82<\/p>\n<p>IteratorWhereSelectToList<br \/>\n.NET 8.0<br \/>\n94.84 ns<br \/>\n1.00<br \/>\n288 B<br \/>\n1.00<\/p>\n<p>IteratorWhereSelectToList<br \/>\n.NET 9.0<br \/>\n89.42 ns<br \/>\n0.94<br \/>\n248 B<br \/>\n0.86<\/p>\n<p>A few more tweaks were made to ToList and ToArray in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/95224\">dotnet\/runtime#95224<\/a> from <a href=\"https:\/\/github.com\/Windows10CE\">@Windows10CE<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100218\">dotnet\/runtime#100218<\/a>. The former improved ToList on the result of a Distinct or Union by enabling HashSet&lt;T&gt;\u2018s CopyTo implementation to be used; previously a custom function was manually iterating through the set, and this PR deleted that code (yay!) and just used List&lt;T&gt;\u2018s constructor directly. The latter PR also improved Distinct and Union, but for ToArray, and specifically in the case where it would have allocated a 0-length array when the source was empty. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99639\">dotnet\/runtime#99639<\/a> also improved ToArray and ToList on the result of an OrderBy; OrderBy\u2018s iterator already special-cased empty sources, but with small tweaks it could also be made to optimize sources with only a single element, in which case no additional work needs to be done (a length-1 array is inherently sorted).<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public int[] OrderByToArray() =&gt; GetItems(1).OrderBy(x =&gt; x).ToArray();<\/p>\n<p>    private IEnumerable&lt;int&gt; GetItems(int count)<br \/>\n    {<br \/>\n        for (int i = 0; i &lt; count; i++)<br \/>\n        {<br \/>\n            yield return i;<br \/>\n        }<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>OrderByToArray<br \/>\n.NET 8.0<br \/>\n66.99 ns<br \/>\n1.00<br \/>\n352 B<br \/>\n1.00<\/p>\n<p>OrderByToArray<br \/>\n.NET 9.0<br \/>\n53.92 ns<br \/>\n0.81<br \/>\n160 B<br \/>\n0.45<\/p>\n<p>Not to be left out from the fun its To cousins are having, ToDictionary also sees improvements from <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96574\">dotnet\/runtime#96574<\/a> from <a href=\"https:\/\/github.com\/xin9le\">@xin9le<\/a>. The PR changes the code to do a better job setting the capacity of the Dictionary&lt;TKey, TValue&gt; prior to filling it, and also using the CollectionsMarshal.AsSpan to share code for handling sources that are arrays and lists, while also shaving off some overhead by enumerating the span instead of the list directly.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private readonly IEnumerable&lt;KeyValuePair&lt;int, int&gt;&gt; _enumerable = Enumerable.Range(0, 10_000).Select(x =&gt; new KeyValuePair&lt;int, int&gt;(x, x));<\/p>\n<p>    [Benchmark]<br \/>\n    public Dictionary&lt;int, KeyValuePair&lt;int, int&gt;&gt; EnumerableToDictionary() =&gt; _enumerable.ToDictionary(x =&gt; x.Key);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>EnumerableToDictionary<br \/>\n.NET 8.0<br \/>\n284.3 us<br \/>\n1.00<br \/>\n788.73 KB<br \/>\n1.00<\/p>\n<p>EnumerableToDictionary<br \/>\n.NET 9.0<br \/>\n149.9 us<br \/>\n0.53<br \/>\n237.01 KB<br \/>\n0.30<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96605\">dotnet\/runtime#96605<\/a> updated Enumerable.Min and Enumerable.Max to specialize for char, Int128, and UInt128 (previous changes specialized other numerical primitives, but these had been left out). By taking advantage of the existing code paths for handling those other primitives, with just a few lines added\/changed, these types can now utilize those faster paths, which in particular special-case arrays and lists (which means it can then avoid an enumerator allocation in addition to faster access to each element).<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private Int128[] _values = Enumerable.Range(0, 1000).Select(x =&gt; (Int128)x).ToArray();<\/p>\n<p>    [Benchmark]<br \/>\n    public Int128 Max() =&gt; _values.Max();<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>Max<br \/>\n.NET 8.0<br \/>\n1,882.0 ns<br \/>\n1.00<br \/>\n32 B<br \/>\n1.00<\/p>\n<p>Max<br \/>\n.NET 9.0<br \/>\n624.7 ns<br \/>\n0.33<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>The aforementioned special code paths for the primitive types also support vectorization. Previously that vectorization only supported 128-bit and 256-bit vector widths, but as of <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/93369\">dotnet\/runtime#93369<\/a> from <a href=\"https:\/\/github.com\/Spacefish\">@Spacefish<\/a>, it now also supports 512-bit vector widths, possibly doubling the throughput of Enumerable.Min and Enumerable.Max on supported hardware with the core numerical primitive types.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private int[] _values = Enumerable.Range(0, 10_000).ToArray();<\/p>\n<p>    [Benchmark]<br \/>\n    public int Max() =&gt; _values.Max();<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Max<br \/>\n.NET 8.0<br \/>\n327.6 ns<br \/>\n1.00<\/p>\n<p>Max<br \/>\n.NET 9.0<br \/>\n166.3 ns<br \/>\n0.51<\/p>\n<p>One caveat here about AVX512. Some AVX512 hardware, even on recent chips, can take a measurable amount of time to \u201cpower up,\u201d such that it might be tens, hundreds, or even thousands of cycles before AVX512 processing ends up actually dispatching 512-bit vectors. Until then, the hardware might end up doing the equivalent of dispatching two 256-bit vectors. On my machine, for example, if I lower the size in the previous benchmark from 10,000 elements to 1,000 elements, the .NET 9 improvement disappears and it ends up running at exactly the same throughput as on .NET 8; on a colleague\u2019s machine with a different processor, even at 1,000 elements the .NET 9 throughput is almost twice that of .NET 8. This is all to say, your mileage may vary. In some of the micro-benchmarks discussed in this post, small improvements are made to already very fast operations, and the gains then come from those operations being done many, many, many times on hot paths. In others, the gains come from taking an expensive operation and making it measurably cheaper. In general the benefits with using AVX512 in these kinds of vectorized implementations come for the latter case, where large data sizes lead to operations taking significant amounts of time, and the use of 512-bit vectors instead of 256-bit vectors measurably speeds up those longer operations.<\/p>\n<p>The OrderBy family of methods on Enumerable were also improved in several ways:<\/p>\n<p>Ordering operations followed by a First() or Last() call were already specialized to completely avoid the O(N log N) sort and instead do an O(N) search for the min or max. However, OrderBy in LINQ is fairly complicated because it needs to account for the possibility of one or more subsequent ThenBy operations that impact the sort order, and thus it uses a custom comparison mechanism that factors in the possibility of such refinement. That custom comparer mechanism was being used as part of those First\/Last specializations. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97483\">dotnet\/runtime#97483<\/a> detects whether there are any ThenBys in play, and if there aren\u2019t, it bypasses that customization and, in doing so, avoids its overhead. That can be very measurable, but in certain cases, it can be enormous, as it can enable other optimizations to kick in, e.g. an Order().Last() on an int[] can just end up doing a vectorized search for the max.<br \/>\n\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Runtime.InteropServices;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private List _ints;<\/p>\n<p>    [GlobalSetup]<br \/>\n    public void Setup()<br \/>\n    {<br \/>\n        _ints = new(Enumerable.Range(-8000, 8000 * 2));<br \/>\n        new Random(42).Shuffle(CollectionsMarshal.AsSpan(_ints));<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public int OrderByLast_Int32() =&gt; _ints.OrderBy(x =&gt; x).Last();<\/p>\n<p>    [Benchmark]<br \/>\n    public int OrderLast_Int32() =&gt; _ints.Order().Last();<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>OrderByLast_Int32<br \/>\n.NET 8.0<br \/>\n34,715.6 ns<br \/>\n1.00<br \/>\n136 B<br \/>\n1.00<\/p>\n<p>OrderByLast_Int32<br \/>\n.NET 9.0<br \/>\n25,001.1 ns<br \/>\n0.72<br \/>\n128 B<br \/>\n0.94<\/p>\n<p>OrderLast_Int32<br \/>\n.NET 8.0<br \/>\n36,064.9 ns<br \/>\n1.00<br \/>\n112 B<br \/>\n1.00<\/p>\n<p>OrderLast_Int32<br \/>\n.NET 9.0<br \/>\n693.8 ns<br \/>\n0.02<br \/>\n56 B<br \/>\n0.50<\/p>\n<p>In .NET 8, Enumerable.Order was updated to recognize that sorting of certain primitive types is implicitly stable even if an unstable sorting algorithm is used, because any two values of such types that compare equally are indistinguishable in memory (e.g. the only Int32 values that compare equally are those with the exact same bit patterns in memory). <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99533\">dotnet\/runtime#99533<\/a> improves this logic to also handle enums whose underlying type counts.<br \/>\ndotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private IEnumerable&lt;DayOfWeek&gt; _days = new Random(42).GetItems(Enum.GetValues&lt;DayOfWeek&gt;(), 100);<\/p>\n<p>    [Benchmark]<br \/>\n    public int Order()<br \/>\n    {<br \/>\n        int sum = 0;<br \/>\n        foreach (DayOfWeek dow in _days.Order())<br \/>\n        {<br \/>\n            sum += (int)dow;<br \/>\n        }<br \/>\n        return sum;<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>Order<br \/>\n.NET 8.0<br \/>\n1,652.9 ns<br \/>\n1.00<br \/>\n1088 B<br \/>\n1.00<\/p>\n<p>Order<br \/>\n.NET 9.0<br \/>\n873.0 ns<br \/>\n0.53<br \/>\n544 B<br \/>\n0.50<\/p>\n<p>In .NET 8, a change was submitted to Enumerable.Range to vectorize its operation when followed by methods like ToArray. At the time, we had some debate about whether to merge the change, with me asking questions like \u201cWho would actually use Enumerable.Range(&#8230;).ToArray() on code paths that care about performance?\u201d As it turns out, we do! As part of OrderBy\u2018s stable sort implementation, it had code like this:<br \/>\nint[] map = new int[count];<br \/>\nfor (int i = 0; i &lt; map.Length; i++)<br \/>\n{<br \/>\n    map[i] = i;<br \/>\n}<\/p>\n<p>For all intents and purposes, that\u2019s Enumerable.Range(0, count).ToArray(). <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99538\">dotnet\/runtime#99538<\/a> recognizes this and uses that same vectorized helper to fill this array in a vectorized manner, and that overhead can actually be measurable in some cases.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private IEnumerable&lt;int&gt; _data = Enumerable.Range(0, 1000).ToArray();<\/p>\n<p>    [Benchmark]<br \/>\n    public int OrderBy()<br \/>\n    {<br \/>\n        int sum = 0;<br \/>\n        foreach (int value in _data.OrderBy(i =&gt; i))<br \/>\n        {<br \/>\n            sum += value;<br \/>\n        }<\/p>\n<p>        return sum;<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>OrderBy<br \/>\n.NET 8.0<br \/>\n14.83 us<br \/>\n1.00<\/p>\n<p>OrderBy<br \/>\n.NET 9.0<br \/>\n13.48 us<br \/>\n0.91<\/p>\n<p>GroupBy and ToLookup also get some dedicated improvements in .NET 9, thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99365\">dotnet\/runtime#99365<\/a>. GetEnumerator on the grouping object returned by these methods was implemented using a simple C# iterator:<\/p>\n<p>public IEnumerator&lt;TElement&gt; GetEnumerator()<br \/>\n{<br \/>\n    for (int i = 0; i &lt; _count; i++)<br \/>\n    {<br \/>\n        yield return _elements[i];<br \/>\n    }<br \/>\n}<\/p>\n<p>In general we favor using C# iterators over manual implementations (unless we\u2019re going to go all out and implement all of the Iterator&lt;TSource&gt; logic) because C# iterators make the code so simple and maintainable. In this particular case, however, this is a reasonably common hot path and we can actually do meaningfully better by hand than the compiler is able to do today. When the compiler generates a state machine for the previous iterator, it does so with a dedicated state field, but with a manual implementation, we can use the same field for state as we use for the iteration variable, which also means we only need to update one thing per loop iteration.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private ILookup&lt;int, string&gt; _stringsByLength =<br \/>\n        (from i in Enumerable.Range(0, 10)<br \/>\n         from item in Enumerable.Range(0, 8)<br \/>\n         select new string((char)(&#8216;a&#8217; + item), i + 1)).ToLookup(s =&gt; s.Length);<\/p>\n<p>    [Benchmark]<br \/>\n    public int Sum()<br \/>\n    {<br \/>\n        int sum = 0;<br \/>\n        foreach (IGrouping&lt;int, string&gt; group in _stringsByLength)<br \/>\n        {<br \/>\n            foreach (string item in group)<br \/>\n            {<br \/>\n                sum += item.Length;<br \/>\n            }<br \/>\n        }<br \/>\n        return sum;<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Sum<br \/>\n.NET 8.0<br \/>\n290.3 ns<br \/>\n1.00<\/p>\n<p>Sum<br \/>\n.NET 9.0<br \/>\n267.4 ns<br \/>\n0.92<\/p>\n<h3>Core Collections<\/h3>\n<p>As shared in <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-8\/#list\">Performance Improvements in .NET 8<\/a>, Dictionary&lt;TKey, TValue&gt; is one of the most popular collections in all of .NET, by far (probably not surprising to anyone). And in .NET 9, it gets a performance-focused feature I\u2019ve been wanting for years.<\/p>\n<p>One of the most common uses for a dictionary is as a cache, often indexed by a string key. And for high-performance scenarios, such caches are frequently used in situations where an actual string object may not be available, but where the text is available, just in a different form, like a ReadOnlySpan&lt;char&gt; (or for caches indexed by UTF8 data, the key might be a byte[] yet the data to perform the lookup is only available as a ReadOnlySpan&lt;byte&gt;). Performing the lookup on the dictionary then would either require materializing a string from the data, which makes the lookup more costly (and in some cases can entirely defeat the purposes of the cache), or require using a custom key type that\u2019s capable of handling multiple forms of the data, which then also generally requires a custom comparer.<\/p>\n<p>This has been addressed in .NET 9 with the introduction of IAlternateEqualityComparer&lt;TAlternate, T&gt;. A comparer that implements IEqualityComparer&lt;T&gt; may now also implement this additional interface one or more times for other TAlternate types, making it possible for that comparer to treat alternate types the same as the T. Then a type like Dictionary&lt;TKey, TValue&gt; can expose additional methods that work in terms of a TAlternateKey and allow them to work if the comparer in that Dictionary&lt;TKey, TValue&gt; implements IAlternateEqualityComparer&lt;TAlternateKey, TKey&gt;. In .NET 9 with <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102907\">dotnet\/runtime#102907<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103191\">dotnet\/runtime#103191<\/a>, Dictionary&lt;TKey, TValue&gt;, ConcurrentDictionary&lt;TKey, TValue&gt;, FrozenDictionary&lt;TKey, TValue&gt;, HashSet&lt;T&gt;, and FrozenSet&lt;T&gt; all do exactly that. For example, here I have a Dictionary&lt;string, int&gt; I\u2019m using to count the number of occurrences of each word in a span:<\/p>\n<p>static Dictionary&lt;string, int&gt; CountWords1(ReadOnlySpan&lt;char&gt; input)<br \/>\n{<br \/>\n    Dictionary&lt;string, int&gt; result = new(StringComparer.OrdinalIgnoreCase);<\/p>\n<p>    foreach (ValueMatch match in Regex.EnumerateMatches(input, @&#8221;bw+b&#8221;))<br \/>\n    {<br \/>\n        ReadOnlySpan&lt;char&gt; word = input.Slice(match.Index, match.Length);<br \/>\n        string key = word.ToString();<br \/>\n        result[key] = result.TryGetValue(key, out int count) ? count + 1 : 1;<br \/>\n    }<\/p>\n<p>    return result;<br \/>\n}<\/p>\n<p>I\u2019m returning a Dictionary&lt;string, int&gt;, so I certainly need to materialize the string for each ReadOnlySpan&lt;char&gt; in order to <em>store<\/em> it in the dictionary, but I should only need to do so once, the first time the word is found. I shouldn\u2019t need to create a new string each time, yet I\u2019m having to in order to do the TryGetValue call. Now with .NET 9, a new GetAlternateLookup method (and a corresponding TryGetAlternateLookup) exists to produce a separate value type wrapper that enables using an alternate key type for all the relevant operations, which means I can now write this:<\/p>\n<p>static Dictionary&lt;string, int&gt; CountWords2(ReadOnlySpan&lt;char&gt; input)<br \/>\n{<br \/>\n    Dictionary&lt;string, int&gt; result = new(StringComparer.OrdinalIgnoreCase);<br \/>\n    Dictionary&lt;string, int&gt;.AlternateLookup&lt;ReadOnlySpan&lt;char&gt;&gt; alternate = result.GetAlternateLookup&lt;ReadOnlySpan&lt;char&gt;&gt;();<\/p>\n<p>    foreach (ValueMatch match in Regex.EnumerateMatches(input, @&#8221;bw+b&#8221;))<br \/>\n    {<br \/>\n        ReadOnlySpan&lt;char&gt; word = input.Slice(match.Index, match.Length);<br \/>\n        alternate[word] = alternate.TryGetValue(word, out int count) ? count + 1 : 1;<br \/>\n    }<\/p>\n<p>    return result;<br \/>\n}<\/p>\n<p>Note the distinct lack of a ToString(), which means no allocation will occur here for words already seen. How then does the alternate[word] = &#8230; part work? Surely this isn\u2019t storing a ReadOnlySpan&lt;char&gt; into the dictionary? Nope. Rather, IAlternateEqualityComparer&lt;TAlternate, T&gt; looks like this:<\/p>\n<p>public interface IAlternateEqualityComparer&lt;in TAlternate, T&gt;<br \/>\n    where TAlternate : allows ref struct<br \/>\n    where T : allows ref struct<br \/>\n{<br \/>\n    bool Equals(TAlternate alternate, T other);<br \/>\n    int GetHashCode(TAlternate alternate);<br \/>\n    T Create(TAlternate alternate);<br \/>\n}<\/p>\n<p>The Equals and GetHashCode should look familiar, the main difference from the corresponding members of IEqualityComparer&lt;T&gt; just being the type of the first parameter. But then there\u2019s this additional Create method. That method accepts a TAlternate and returns a T, giving the comparer the ability to map from one to the other. That setter we saw previously (and other methods like TryAdd) are able to use this to only create the TKey from the TAlternateKey when they have to, so the setter here will only allocate the string for the word if the word doesn\u2019t already exist in the collection.<\/p>\n<p>Another possibly perplexing thing for anyone reading this and who\u2019s well versed in the ways of ReadOnlySpan&lt;T&gt;: how in the world is Dictionary&lt;string, int&gt;.AlternateLookup&lt;ReadOnlySpan&lt;char&gt;&gt; valid? ref structs like span can\u2019t be used as generic parameters, right? Right\u2026 until now. C# 13 and .NET 9 now permit ref structs as generic parameters, but the generic parameter needs to opt-in to it via the new allows ref struct constraint (or \u201canti-constraint\u201d as some of us frequently refer to it). There are things a method can do with an instance of an unconstrained generic parameter, like cast it to object or store it into a field of a class, that can\u2019t be done with ref struct. By adding allows ref struct to a generic parameter, it tells the compiler compiling the consumer that it may specify a ref struct, and it tells the compiler compiling the type or method with the constraint that the generic instantiation might be a ref struct and thus the generic parameter can only be used in situations where a ref struct would be legal.<\/p>\n<p>Of course, all of this working hinges on the supplied comparer sporting the appropriate IAlternateEqualityComparer&lt;TAlternate, T&gt; implementation; if it doesn\u2019t, attempts to call GetAlternateLookup will throw an exception, and attempts to call TryGetAlternateLookup will return false. You can use these collection types with whatever comparer you want, and that comparer can provide implementations of this interface for whatever alternate key types you want. But with string and ReadOnlySpan&lt;char&gt; being so common, it\u2019d be a shame if there wasn\u2019t built-in support for this combination. And indeed, with the aforementioned PRs, all of the built-in StringComparer types implement IAlternateEqualityComparer&lt;ReadOnlySpan&lt;char&gt;, string&gt;. That\u2019s why the Dictionary&lt;string, int&gt; result = new(StringComparer.OrdinalIgnoreCase); line is successful in the previous code example, as the subsequent call to result.GetAlternateLookup&lt;ReadOnlySpan&lt;char&gt;&gt;() will successfully find the interface on the supplied comparer.<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Text.RegularExpressions;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic partial class Tests<br \/>\n{<br \/>\n    private static readonly string s_input = new HttpClient().GetStringAsync(&#8220;https:\/\/gutenberg.org\/cache\/epub\/2600\/pg2600.txt&#8221;).Result;<\/p>\n<p>    [GeneratedRegex(@&#8221;bw+b&#8221;)]<br \/>\n    private static partial Regex WordParser();<\/p>\n<p>    [Benchmark(Baseline = true)]<br \/>\n    public Dictionary&lt;string, int&gt; CountWords1()<br \/>\n    {<br \/>\n        ReadOnlySpan&lt;char&gt; input = s_input;<\/p>\n<p>        Dictionary&lt;string, int&gt; result = new(StringComparer.OrdinalIgnoreCase);<\/p>\n<p>        foreach (ValueMatch match in WordParser().EnumerateMatches(input))<br \/>\n        {<br \/>\n            ReadOnlySpan&lt;char&gt; word = input.Slice(match.Index, match.Length);<br \/>\n            string key = word.ToString();<br \/>\n            result[key] = result.TryGetValue(key, out int count) ? count + 1 : 1;<br \/>\n        }<\/p>\n<p>        return result;<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public Dictionary&lt;string, int&gt; CountWords2()<br \/>\n    {<br \/>\n        ReadOnlySpan&lt;char&gt; input = s_input;<\/p>\n<p>        Dictionary&lt;string, int&gt; result = new(StringComparer.OrdinalIgnoreCase);<br \/>\n        Dictionary&lt;string, int&gt;.AlternateLookup&lt;ReadOnlySpan&lt;char&gt;&gt; alternate = result.GetAlternateLookup&lt;ReadOnlySpan&lt;char&gt;&gt;();<\/p>\n<p>        foreach (ValueMatch match in WordParser().EnumerateMatches(input))<br \/>\n        {<br \/>\n            ReadOnlySpan&lt;char&gt; word = input.Slice(match.Index, match.Length);<br \/>\n            alternate[word] = alternate.TryGetValue(word, out int count) ? count + 1 : 1;<br \/>\n        }<\/p>\n<p>        return result;<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>CountWords1<br \/>\n60.35 ms<br \/>\n1.00<br \/>\n20.67 MB<br \/>\n1.00<\/p>\n<p>CountWords2<br \/>\n57.40 ms<br \/>\n0.95<br \/>\n2.54 MB<br \/>\n0.12<\/p>\n<p>Note the huge reduction in allocation.<\/p>\n<p>For fun, we can also take this example one step further. .NET 6 introduced the CollectionsMarshal.GetValueRefOrAddDefault method, which returns a writable ref to the actual location where the TValue for a given TKey is stored, creating the entry in the dictionary if it doesn\u2019t exist. This is very handy for operations like the one used above, as it helps to avoid an extra dictionary lookup. Without it, we\u2019re doing one lookup as part of the TryGetValue and then another lookup as part of the setter, but with it, we just do the single lookup as part of GetValueRefOrAddDefault and then no additional lookup is necessary because we already have the location into which we can directly write. And as the lookups in this benchmark are one of the more costly elements, eliminating half of them can significantly reduce the cost of the operation. As part of this alternate key effort, a new overload of GetValueRefOrAddDefault was added that works with it, such that the same operation can be performed with a TAlternateKey.<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Runtime.InteropServices;<br \/>\nusing System.Text.RegularExpressions;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic partial class Tests<br \/>\n{<br \/>\n    private static readonly string s_input = new HttpClient().GetStringAsync(&#8220;https:\/\/gutenberg.org\/cache\/epub\/2600\/pg2600.txt&#8221;).Result;<\/p>\n<p>    [GeneratedRegex(@&#8221;bw+b&#8221;)]<br \/>\n    private static partial Regex WordParser();<\/p>\n<p>    [Benchmark(Baseline = true)]<br \/>\n    public Dictionary&lt;string, int&gt; CountWords1()<br \/>\n    {<br \/>\n        ReadOnlySpan&lt;char&gt; input = s_input;<\/p>\n<p>        Dictionary&lt;string, int&gt; result = new(StringComparer.OrdinalIgnoreCase);<\/p>\n<p>        foreach (ValueMatch match in WordParser().EnumerateMatches(input))<br \/>\n        {<br \/>\n            ReadOnlySpan&lt;char&gt; word = input.Slice(match.Index, match.Length);<br \/>\n            string key = word.ToString();<br \/>\n            result[key] = result.TryGetValue(key, out int count) ? count + 1 : 1;<br \/>\n        }<\/p>\n<p>        return result;<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public Dictionary&lt;string, int&gt; CountWords2()<br \/>\n    {<br \/>\n        ReadOnlySpan&lt;char&gt; input = s_input;<\/p>\n<p>        Dictionary&lt;string, int&gt; result = new(StringComparer.OrdinalIgnoreCase);<br \/>\n        Dictionary&lt;string, int&gt;.AlternateLookup&lt;ReadOnlySpan&lt;char&gt;&gt; alternate = result.GetAlternateLookup&lt;ReadOnlySpan&lt;char&gt;&gt;();<\/p>\n<p>        foreach (ValueMatch match in WordParser().EnumerateMatches(input))<br \/>\n        {<br \/>\n            ReadOnlySpan&lt;char&gt; word = input.Slice(match.Index, match.Length);<br \/>\n            alternate[word] = alternate.TryGetValue(word, out int count) ? count + 1 : 1;<br \/>\n        }<\/p>\n<p>        return result;<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public Dictionary&lt;string, int&gt; CountWords3()<br \/>\n    {<br \/>\n        ReadOnlySpan&lt;char&gt; input = s_input;<\/p>\n<p>        Dictionary&lt;string, int&gt; result = new(StringComparer.OrdinalIgnoreCase);<br \/>\n        Dictionary&lt;string, int&gt;.AlternateLookup&lt;ReadOnlySpan&lt;char&gt;&gt; alternate = result.GetAlternateLookup&lt;ReadOnlySpan&lt;char&gt;&gt;();<\/p>\n<p>        foreach (ValueMatch match in WordParser().EnumerateMatches(input))<br \/>\n        {<br \/>\n            ReadOnlySpan&lt;char&gt; word = input.Slice(match.Index, match.Length);<br \/>\n            CollectionsMarshal.GetValueRefOrAddDefault(alternate, word, out _)++;<br \/>\n        }<\/p>\n<p>        return result;<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>CountWords1<br \/>\n60.73 ms<br \/>\n1.00<br \/>\n20.67 MB<br \/>\n1.00<\/p>\n<p>CountWords2<br \/>\n54.01 ms<br \/>\n0.89<br \/>\n2.54 MB<br \/>\n0.12<\/p>\n<p>CountWords3<br \/>\n44.38 ms<br \/>\n0.73<br \/>\n2.54 MB<br \/>\n0.12<\/p>\n<p>\u201cBut wait, there\u2019s more!\u201d <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104202\">dotnet\/runtime#104202<\/a> extends the alternate comparer implementation for string\/ReadOnlySpan&lt;char&gt; further to also apply to EqualityComparer&lt;string&gt;.Default, which means that if you don\u2019t supply a comparer at all, these collection types will still support ReadOnlySpan&lt;char&gt; lookups. That change not only then improves the usability of these new APIs, but it actually had an additional unintended but welcome performance benefit. Previously, EqualityComparer&lt;string&gt;.Default would return an internal GenericEqualityComparer&lt;string&gt; type, derived from EqualityComparer&lt;string&gt;. It wouldn\u2019t be possible to implement IAlternateEqualityComparer&lt;ReadOnlySpan&lt;char&gt;, string&gt; on GenericEqualityComparer&lt;string&gt; because doing so would actually have to be done on GenericEqualityComparer&lt;T&gt;, which would mean every EqualityComparer&lt;T&gt;.Default would support IAlternateEqualityComparer&lt;ReadOnlySpan&lt;char&gt;, T&gt;, and we have no correct way of providing such an implementation for all Ts. Instead, we introduced a new internal non-generic StringEqualityComparer type and made EqualityComparer&lt;T&gt;.Default return an instance of that when T is string (the implementation of Default already knows about and returns a bunch of specialized comparers, this is just one more). In doing so, it made the type that\u2019s used non-generic, which in turn means that in some situations, it eliminates some of the overhead associated with generics.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Runtime.CompilerServices;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private IEqualityComparer&lt;string&gt; _comparer = EqualityComparer&lt;string&gt;.Default;<br \/>\n    private string[] _values = Enumerable.Range(0, 1000).Select(i =&gt; i.ToString()).ToArray();<\/p>\n<p>    [Benchmark]<br \/>\n    public int Count() =&gt; CountEquals(_values, &#8220;500&#8221;, _comparer);<\/p>\n<p>    [MethodImpl(MethodImplOptions.NoInlining)]<br \/>\n    private static int CountEquals&lt;T&gt;(T[] haystack, T needle, IEqualityComparer&lt;T&gt; comparer)<br \/>\n    {<br \/>\n        int count = 0;<br \/>\n        foreach (T value in haystack)<br \/>\n        {<br \/>\n            if (comparer.Equals(value, needle))<br \/>\n            {<br \/>\n                count++;<br \/>\n            }<br \/>\n        }<br \/>\n        return count;<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Count<br \/>\n.NET 8.0<br \/>\n4.477 us<br \/>\n1.00<\/p>\n<p>Count<br \/>\n.NET 9.0<br \/>\n2.808 us<br \/>\n0.63<\/p>\n<p>HashSet&lt;T&gt; also gains all of these super powers, but several additional PRs went into making other performance improvements to it. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/85877\">dotnet\/runtime#85877<\/a> from <a href=\"https:\/\/github.com\/hrrrrustic\">@hrrrrustic<\/a> added a TrimExcess(int capacity) method to HashSet&lt;T&gt; (as well as to Queue&lt;T&gt; and Stack&lt;T&gt;), enabling more fine-grained control over how much memory to cull from a set that might have grown larger than is now required. And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102758\">dotnet\/runtime#102758<\/a> from <a href=\"https:\/\/github.com\/lilinus\">@lilinus<\/a> improved its IsSubsetOf, IsPropertSubsetOf, and SetEquals methods by tweaking the fast paths already present. The methods were attempting to early-exit as soon as the condition could be proved true or false, but some common conditions were being missed, and this rectified those.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private HashSet&lt;int&gt; _set = new(Enumerable.Range(0, 1000));<br \/>\n    private List&lt;int&gt; _list = Enumerable.Range(0, 999).ToList();<\/p>\n<p>    [Benchmark]<br \/>\n    public bool IsSubsetOf() =&gt; _set.IsSubsetOf(_list);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>IsSubsetOf<br \/>\n.NET 8.0<br \/>\n7,351.373 ns<br \/>\n1.000<br \/>\n40 B<br \/>\n1.00<\/p>\n<p>IsSubsetOf<br \/>\n.NET 9.0<br \/>\n1.216 ns<br \/>\n0.000<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96573\">dotnet\/runtime#96573<\/a> from <a href=\"https:\/\/github.com\/ndsvw\">@ndsvw<\/a> also identified a few places in various libraries where a Dictionary&lt;T, T&gt; was being used as a set and replaced them with HashSet&lt;T&gt;. The implementations of Dictionary&lt;&gt; and HashSet&lt;&gt; are very close in nature, but the latter consumes less memory because it doesn\u2019t need to store separate values. Using a Dictionary&lt;T, T&gt; effectively doubles the required storage, so if a HashSet&lt;T&gt; suffices, it\u2019s preferable.<\/p>\n<p>A variety of other collection types have also seen improvements in .NET 9:<\/p>\n<p><strong>PriorityQueue&lt;TElement, TPriority&gt;.<\/strong> The EnqueueRange(IEnumerable&lt;Telement&gt;, TPriority) method enables multiple elements to all be inserted at the same priority. If there are already elements in the collection, this is akin to just calling Enqueue for each. However, if the collection is currently empty, then it can skip the per-element addition costs and instead just store the elements directly into the backing array. After doing so, it was then performing a heapify operation. But <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99139\">dotnet\/runtime#99139<\/a> from <a href=\"https:\/\/github.com\/skyoxZ\">@skyoxZ<\/a> recognized that this heapify was entirely unnecessary, because all of the elements were inserted at the same priority, and there were no elements of any other priority already in the collection. Many performance optimizations come with trade-offs, making one common thing much faster at the expense of making some less common thing a little slower. This, however, is my favorite kind of optimization: elimination of unnecessary work with effectively zero downside.<br \/>\nusing BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[DisassemblyDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private PriorityQueue&lt;int, int&gt; _pq = new();<br \/>\n    private int[] _elements = Enumerable.Range(0, 100).ToArray();<\/p>\n<p>    [Benchmark]<br \/>\n    public void EnqueueRange()<br \/>\n    {<br \/>\n        _pq.Clear();<br \/>\n        _pq.EnqueueRange(_elements, priority: 42);<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>EnqueueRange<br \/>\n.NET 8.0<br \/>\n239.3 ns<br \/>\n1.00<\/p>\n<p>EnqueueRange<br \/>\n.NET 9.0<br \/>\n206.7 ns<br \/>\n0.86<\/p>\n<p><strong>BitArray.<\/strong> Multiple methods on BitArray are already accelerated using Vector128 and Vector256, enabling much faster throughput for methods like And, Or, and Not. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/91903\">dotnet\/runtime#91903<\/a> from <a href=\"https:\/\/github.com\/khushal1996\">@khushal1996<\/a> adds Vector512 support to all of these as well, enabling hardware with AVX512 support to process these operations upwards of twice as fast as before.<br \/>\nusing BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Collections;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private BitArray _first = new BitArray(1024 * 1024);<br \/>\n    private BitArray _second = new BitArray(1024 * 1024);<\/p>\n<p>    [Benchmark]<br \/>\n    public void Or() =&gt; _first.Or(_second);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Or<br \/>\n.NET 8.0<br \/>\n2.894 us<br \/>\n1.00<\/p>\n<p>Or<br \/>\n.NET 9.0<br \/>\n2.354 us<br \/>\n0.81<\/p>\n<p><strong>List&lt;T&gt;.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/90089\">dotnet\/runtime#90089<\/a> from <a href=\"https:\/\/github.com\/karakasa\">@karakasa<\/a> avoided an extra Array.Copy call as part of Insert. Previously the implementation may have done up to three Array.Copy operations, and as part of this change that can drop to just two.<br \/>\n<strong>FrozenDictionary&lt;TKey, TValue&gt; and FrozenSet&lt;T&gt;.<\/strong> These frozen collections introduced in .NET 8 have also received some attention in .NET 9. As a reminder, FrozenDictionary&lt;TKey, TValue&gt; and FrozenSet&lt;T&gt; are immutable collections optimized for reading, willing to spend more time and effort during construction to make subsequent operations on the collections as fast as possible. When the TKey\/T is a string, one optimization employed is to track the minimum and maximum lengths of all strings in the collection; if a string that\u2019s shorter or longer than that is used in a lookup, the collection can immediately report that it\u2019s not in the collection without having to actually perform any lookup, instead just comparing against the min and max. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92546\">dotnet\/runtime#92546<\/a> from <a href=\"https:\/\/github.com\/andrewjsaid\">@andrewjsaid<\/a> extends this further by employing a bitmap of up to 64 bits corresponding to lengths of strings contained in the collection. On lookup, rather than only comparing against min\/max, the implementation can test whether the corresponding bit for the string\u2018s length is set, bailing immediately if it\u2019s not. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100998\">dotnet\/runtime#100998<\/a> also reduced creation overheads with frozen collections created with string keys and StringComparer.OrdinalIgnoreCase. The implementation had been using its own custom comparison logic for hash code generation, in order to support building for netstandard2.0 in addition to .NET Core, but this PR specialized the code for .NET Core to use string.GetHashCode(ReadOnlySpan&lt;char&gt;, StringComparison), which is more efficient than the custom implementation.<br \/>\nusing BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Collections.Frozen;<br \/>\nusing System.Text.RegularExpressions;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private static readonly FrozenSet&lt;string&gt; s_words = Regex.Matches(&#8220;&#8221;&#8221;<br \/>\n        Let me not to the marriage of true minds<br \/>\n        Admit impediments; love is not love<br \/>\n        Which alters when it alteration finds,<br \/>\n        Or bends with the remover to remove.<br \/>\n        O no, it is an ever-fixed mark<br \/>\n        That looks on tempests and is never shaken;<br \/>\n        It is the star to every wand&#8217;ring bark<br \/>\n        Whose worth&#8217;s unknown, although his height be taken.<br \/>\n        Love&#8217;s not time&#8217;s fool, though rosy lips and cheeks<br \/>\n        Within his bending sickle&#8217;s compass come.<br \/>\n        Love alters not with his brief hours and weeks,<br \/>\n        But bears it out even to the edge of doom:<br \/>\n        If this be error and upon me proved,<br \/>\n        I never writ, nor no man ever loved.<br \/>\n        &#8220;&#8221;&#8221;, @&#8221;bw+b&#8221;).Cast&lt;Match&gt;().Select(w =&gt; w.Value).ToFrozenSet();<br \/>\n    private string _word = &#8220;quickness&#8221;;<\/p>\n<p>    [GlobalSetup] public void Setup() =&gt; Console.WriteLine(s_words);<\/p>\n<p>    [Benchmark]<br \/>\n    public bool Contains() =&gt; s_words.Contains(_word);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Contains<br \/>\n.NET 8.0<br \/>\n4.373 ns<br \/>\n1.00<\/p>\n<p>Contains<br \/>\n.NET 9.0<br \/>\n1.154 ns<br \/>\n0.26<\/p>\n<h2>Compression<\/h2>\n<p>It\u2019s an important goal of the core .NET libraries to be as platform-agnostic as possible. Things should generally behave the same way regardless of which operating system or which hardware is being used, excepting things that really are operating system or hardware specific (e.g. we purposefully don\u2019t try to paper over casing differences of different file systems). To that end, we generally implement as much as possible in C#, deferring down to the operating system and native platform libraries only when necessary; for example, the default .NET HTTP implementation, System.Net.Http.SocketsHttpHandler, is written in C# on top of System.Net.Sockets, System.Net.Dns, etc., and subject to the implementation of sockets on each platform (where behaviors are implemented by the operating system), generally behaves the same wherever you\u2019re running.<\/p>\n<p>There are, however, just a few specific places where we\u2019ve actively made the choice to defer more to something in the platform. The most important case here is cryptography, where we want to rely on the operating system for such security-related functionality; so on Windows, for example, TLS is implemented in terms of components like SChannel, on Linux in terms of OpenSSL, and on macOS in terms of SecureTransport. The other notable case has been compression, and in particular zlib. We decided long ago to simply use whatever zlib was distributed with the operating system. That has had various implications, however. Starting with the fact that Windows doesn\u2019t ship with zlib as a library exposed for consumption, so the .NET build targeting Windows still had to include its own copy of zlib. That was then improved but also complicated by a decision to switch to distribute a variant of zlib produced by Intel, which was nicely optimized further for x64, but which didn\u2019t have as much attention paid to other hardware, like Arm64. And very recently, the intel\/zlib repository was archived and is not actively being maintained by Intel.<\/p>\n<p>To simplify things, to improve consistency and performance across more platforms, and to move to an actively supported and evolving implementation, this changes for .NET 9. Thanks to a stream of PRs, and in particular <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104454\">dotnet\/runtime#104454<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/105771\">dotnet\/runtime#105771<\/a>, .NET 9 now includes the zlib functionality built-in across Windows, Linux, and macOS, based on the newer <a href=\"https:\/\/github.com\/zlib-ng\/zlib-ng\">zlib-ng\/zlib-ng<\/a>. zlib-ng is a zlib-compatible API that is actively maintained, includes improvements previously made to both Intel and Cloudflare\u2019s forks, and has received improvements across many different CPU intrinsics.<\/p>\n<p>Benchmarking just throughput is easy with BenchmarkDotNet. Unfortunately, while I love the tool, the lack of <a href=\"https:\/\/github.com\/dotnet\/BenchmarkDotNet\/issues\/784\">dotnet\/BenchmarkDotNet#784<\/a> makes it very challenging to appropriately benchmark compression, because throughput is only one part of the equation. Compression ratio is also a key part of the equation (you can make \u201ccompression\u201d super fast by just outputting the input without actually manipulating it at all), so we also need to know about compressed output size when discussing compression speeds. To do that for this post, I\u2019ve hacked up just enough in this benchmark to make it work for this example, implementing a custom column for BenchmarkDotNet, but please note this is not a general-purpose implementation.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221;<br \/>\n\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Columns;<br \/>\nusing BenchmarkDotNet.Configs;<br \/>\nusing BenchmarkDotNet.Reports;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.IO.Compression;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(<br \/>\n    args,<br \/>\n    DefaultConfig.Instance.AddColumn(new CompressedSizeColumn()));<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private byte[] _uncompressed = new HttpClient().GetByteArrayAsync(@&#8221;https:\/\/www.gutenberg.org\/cache\/epub\/3200\/pg3200.txt&#8221;).Result;<\/p>\n<p>    [Params(CompressionLevel.NoCompression, CompressionLevel.Fastest, CompressionLevel.Optimal, CompressionLevel.SmallestSize)]<br \/>\n    public CompressionLevel Level { get; set; }<\/p>\n<p>    private MemoryStream _compressed = new MemoryStream();<\/p>\n<p>    private long _compressedSize;<\/p>\n<p>    [Benchmark]<br \/>\n    public void Compress()<br \/>\n    {<br \/>\n        _compressed.Position = 0;<br \/>\n        _compressed.SetLength(0);<\/p>\n<p>        using (var ds = new DeflateStream(_compressed, Level, leaveOpen: true))<br \/>\n        {<br \/>\n            ds.Write(_uncompressed, 0, _uncompressed.Length);<br \/>\n        }<\/p>\n<p>        _compressedSize = _compressed.Length;<br \/>\n    }<\/p>\n<p>    [GlobalCleanup]<br \/>\n    public void SaveSize()<br \/>\n    {<br \/>\n        File.WriteAllText(Path.Combine(Path.GetTempPath(), $&#8221;Compress_{Level}&#8221;), _compressedSize.ToString());<br \/>\n    }<br \/>\n}<\/p>\n<p>public class CompressedSizeColumn : IColumn<br \/>\n{<br \/>\n    public string Id =&gt; nameof(CompressedSizeColumn);<br \/>\n    public string ColumnName { get; } = &#8220;CompressedSize&#8221;;<br \/>\n    public bool AlwaysShow =&gt; true;<br \/>\n    public ColumnCategory Category =&gt; ColumnCategory.Custom;<br \/>\n    public int PriorityInCategory =&gt; 1;<br \/>\n    public bool IsNumeric =&gt; true;<br \/>\n    public UnitType UnitType { get; } = UnitType.Size;<br \/>\n    public string Legend =&gt; &#8220;CompressedSize Bytes&#8221;;<br \/>\n    public bool IsAvailable(Summary summary) =&gt; true;<br \/>\n    public bool IsDefault(Summary summary, BenchmarkCase benchmarkCase) =&gt; true;<br \/>\n    public string GetValue(Summary summary, BenchmarkCase benchmarkCase, SummaryStyle style) =&gt;<br \/>\n        GetValue(summary, benchmarkCase);<br \/>\n    public string GetValue(Summary summary, BenchmarkCase benchmarkCase) =&gt;<br \/>\n        File.ReadAllText(Path.Combine(Path.GetTempPath(), $&#8221;Compress_{benchmarkCase.Parameters.Items[0].Value}&#8221;)).Trim();<br \/>\n}<\/p>\n<p>Running that for .NET 8, I get this:<\/p>\n<p>Method<br \/>\nLevel<br \/>\nMean<br \/>\nCompressedSize<\/p>\n<p>Compress<br \/>\nNoCompression<br \/>\n1.783 ms<br \/>\n16015049<\/p>\n<p>Compress<br \/>\nFastest<br \/>\n164.495 ms<br \/>\n7312367<\/p>\n<p>Compress<br \/>\nOptimal<br \/>\n620.987 ms<br \/>\n6235314<\/p>\n<p>Compress<br \/>\nSmallestSize<br \/>\n867.076 ms<br \/>\n6208245<\/p>\n<p>and for .NET 9, I get this:<\/p>\n<p>Method<br \/>\nLevel<br \/>\nMean<br \/>\nCompressedSize<\/p>\n<p>Compress<br \/>\nNoCompression<br \/>\n1.814 ms<br \/>\n16015049<\/p>\n<p>Compress<br \/>\nFastest<br \/>\n64.345 ms<br \/>\n9578398<\/p>\n<p>Compress<br \/>\nOptimal<br \/>\n230.646 ms<br \/>\n6276158<\/p>\n<p>Compress<br \/>\nSmallestSize<br \/>\n567.579 ms<br \/>\n6215048<\/p>\n<p>A few interesting things to note here:<\/p>\n<p>On both .NET 8 and .NET 9, there\u2019s an obvious correlation: the more compression is requested, the slower it gets and the smaller the file size becomes.<br \/>\nNoCompression, which really just echos the input bytes back as output, produces the exact same compressed size across .NET 8 and .NET 9, as one would hope; the compressed size should be identical to the input size.<br \/>\nThe compressed size for SmallestSize is almost the same between .NET 8 and .NET 9; they differ by only ~0.1%, but for that small increase, the SmallestSize throughput ends up being ~35% faster. In both cases, the .NET layer is just passing down a zlib compression level of 9, which is the largest value possible and denotes best-possible compression. It just happens that with zlib-ng, that best possible compression is significantly faster and just a tad bit worse compression-ratio-wise than with zlib.<br \/>\nFor Optimal, which is also the default and represents a balanced tradeoff between speed and compression ratio (with 20\/20 hindsight, the name for this member should have been Balanced), the .NET 9 version using zlib-ng is 60% faster while only sacrificing ~0.6% on compression ratio.<br \/>\nFastest is interesting. The .NET implementation is just passing down a compression level of 1 to the zlib-ng native code, indicating to choose the fastest speed while still doing <em>some<\/em> compression (0 means don\u2019t compress at all). But the zlib-ng implementation is obviously making different trade-offs than did the older zlib code, as it\u2019s truer to its name: it\u2019s running more than 2x as fast and still compressing, but the compressed output is ~30% larger than the output on .NET 8.<\/p>\n<p>The net effect of this is, especially if you\u2019re using Fastest, you might want to re-evaluate to see whether the throughput \/ compression ratios meet your needs. If you want to tweak it further, though, you\u2019re no longer limited to just these options. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/105430\">dotnet\/runtime#105430<\/a> adds new constructors to DeflateStream, GZipStream, ZLibStream, and also the unrelated BrotliStream, enabling more fine-grained control over the parameters passed to the native implementations, e.g.<\/p>\n<p>private static readonly ZLibCompressionOptions s_options = new ZLibCompressionOptions()<br \/>\n{<br \/>\n    CompressionLevel = 2,<br \/>\n};<br \/>\n&#8230;<br \/>\nStream sourceStream = &#8230;;<br \/>\nusing var ds = new DeflateStream(compressed, s_options, leaveOpen: true)<br \/>\n{<br \/>\n    sourceStream.CopyTo(ds);<br \/>\n}<\/p>\n<h2>Cryptography<\/h2>\n<p>Investments in System.Security.Cryptography are generally focused on improving the security of a system, supporting new cryptographic primitives, better integrating with security capabilities of the underlying operating system, and so on. But as cryptography is ever present in most modern systems, it\u2019s also impactful to make the existing functionality more efficient, and a variety of PRs in .NET 9 have done so.<\/p>\n<p>Let\u2019s start with random number generation. .NET 8 added a new GetItems method to both Random (the core non-cryptographically-secure random number generator) and RandomNumberGenerator (the core cryptographically-secure random number generator). This method is very handy when you need to randomly generate N elements sourced from a specific set of values. For example, if you wanted to write 100 random hex characters to a destination Span&lt;char&gt;, you could do:<\/p>\n<p>Span&lt;char&gt; dest = stackalloc char[100];<br \/>\nRandom.Shared.GetItems(&#8220;0123456789abcdef&#8221;, dest);<\/p>\n<p>The core implementation is very simple, and is just a convenience implementation for something you could easily do yourself:<\/p>\n<p>for (int i = 0; i &lt; dest.Length; i++)<br \/>\n{<br \/>\n    dest[i] = choices[Next(choices.Length)];<br \/>\n}<\/p>\n<p>Easy peasy. However, in some situations we can do better. This implementation ends up making a call to the random number generator for each element, and that roundtrip adds measurable overhead. If we could instead make fewer calls, we could ammortize that overhead across however many elements could be filled by that single call. That\u2019s exactly what <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92229\">dotnet\/runtime#92229<\/a> does. If the number of choices is less than or equal to 256 and a power of two, rather than asking for a random integer for each element, we can instead get a byte for each element, and we can do that in bulk with a single call to NextBytes. The max of 256 choices is because that\u2019s the number of values a byte can represent, and the power of two is so that we can simply mask off unnecessary bits from the byte, which helps to avoid bias. This makes a measurable impact for Random, but even more so for RandomNumberGenerator, where each call to get random bytes requires a transition into the operating system.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Security.Cryptography;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private char[] _dest = new char[100];<\/p>\n<p>    [Benchmark]<br \/>\n    public void GetRandomHex() =&gt; RandomNumberGenerator.GetItems&lt;char&gt;(&#8220;0123456789abcdef&#8221;, _dest);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>GetRandomHex<br \/>\n.NET 8.0<br \/>\n58,659.2 ns<br \/>\n1.00<\/p>\n<p>GetRandomHex<br \/>\n.NET 9.0<br \/>\n746.5 ns<br \/>\n0.01<\/p>\n<p>Sometimes performance improvements are about revisiting past assumptions. .NET 5 added a new GC.AllocateArray method which optionally allows that array to be created on the \u201cpinned object heap,\u201d or POH. Allocating on the POH is the same as allocating normally, except that the GC guarantees that objects on the POH won\u2019t be moved (normally the GC is free to compact the heap, moving objects around in order to reduce fragmentation). This is a useful guarantee for cryptography, which employs defense-in-depth measures like zero\u2019ing out buffers to reduce the chances of an attacker being able to find sensitive information in the memory (or memory dump) of a process. The crypto library wants to be able to allocate some memory, use it to temporarily contain some sensitive information, and then zero out the memory before stopping using it, but if the GC is able to move the object around in the interim, it could end up leaving shadows of the data on the heap. When the POH was introduced, then, System.Security.Cryptography started using it, including for relatively short-lived objects. This is potentially problematic, however. Because the nature of the POH is that objects can\u2019t be moved around, creating short-lived objects on the POH can significantly increase fragmentation, which can in turn increase memory consumption, GC costs, and so on. And as a result, the POH is really only recommended for long-lived objects, ideally ones that you create and then hold onto for the remainder of the process\u2019 lifetime. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99168\">dotnet\/runtime#99168<\/a> undid System.Security.Cryptography\u2018s reliance on the POH, instead preferring to use native memory (e.g. via NativeMemory.Alloc and NativeMemory.Free) for such needs.<\/p>\n<p>On the subject of memory, multiple PRs went into the crypto libraries to reduce allocation. Here are some examples:<\/p>\n<p><strong>Marshaling pointers instead of temporary arrays.<\/strong> The CngKey type exposes properties like ExportPolicy, IsMachineKey, and KeyUsage, all of which utilize an internal GetPropertyAsDword method that P\/Invokes to retrieve an integer from Windows. It was doing so, however, via a shared helper that was allocating a 4-byte byte[], passing that to the OS to fill, and then converting those four bytes into an int. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/91521\">dotnet\/runtime#91521<\/a> changed the interop path to instead just store the int on the stack, passing a pointer to it to the OS, avoiding the need to allocate and parse.<br \/>\n<strong>Special-casing empty.<\/strong> Throughout the core libraries, we rely heavily on Array.Empty&lt;T&gt;() to avoid allocating lots of empty arrays when we could instead just employ singletons. The crypto libraries work with a lot of arrays, and as part of defense-in-depth, will often hand out clones of those arrays rather than handing out the same array to everyone; that\u2019s handled by a shared CloneByteArray helper. As it turns out, however, it\u2019s reasonably common for arrays to be empty, yet CloneByteArray wasn\u2019t special-casing them, and was thus always allocating new arrays even if the input was empty. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/93231\">dotnet\/runtime#93231<\/a> simply special-cased empty input arrays to return themselves rather than clone them.<br \/>\n<strong>Avoiding unnecessary defensive copies.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97108\">dotnet\/runtime#97108<\/a> avoids more defensive copies than just those for empty arrays mentioned above. The PublicKey type is passed two AsnEncodedData instances, one for parameters and one for a key value, and both of which it clones to avoid any issues that might arise with that provided instance being mutated. But in some internal uses, the caller is constructing a temporary AsnEncodedData and effectively transferring ownership, yet PublicKey would still then make a defensive copy, even though the temporary could have just been used in its stead. This change enables the original instances to just be used without copy in such cases.<br \/>\n<strong>Using collection expressions with spans.<\/strong> One of the really neat things about the collection expressions feature introduced in C# 11 is it allows you to express your intent for what you want and allow the system to implement that as best it can. As part of initializing OidLookup, it had multiple lines that look like this:<br \/>\nAddEntry(&#8220;1.2.840.10045.3.1.7&#8221;, &#8220;ECDSA_P256&#8221;, new[] { &#8220;nistP256&#8221;, &#8220;secP256r1&#8221;, &#8220;x962P256v1&#8221;, &#8220;ECDH_P256&#8221; });<br \/>\nAddEntry(&#8220;1.3.132.0.34&#8221;, &#8220;ECDSA_P384&#8221;, new[] { &#8220;nistP384&#8221;, &#8220;secP384r1&#8221;, &#8220;ECDH_P384&#8221; });<br \/>\nAddEntry(&#8220;1.3.132.0.35&#8221;, &#8220;ECDSA_P521&#8221;, new[] { &#8220;nistP521&#8221;, &#8220;secP521r1&#8221;, &#8220;ECDH_P521&#8221; });<\/p>\n<p>This effectively forced it to allocate these arrays, even though the AddEntry method doesn\u2019t actually require the array-ness and just iterates through the supplied values. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100252\">dotnet\/runtime#100252<\/a> changed AddEntry to take a ReadOnlySpan&lt;string&gt; instead of string[], and changed all the call sites to be collection expressions:<\/p>\n<p>AddEntry(&#8220;1.2.840.10045.3.1.7&#8221;, &#8220;ECDSA_P256&#8221;, [&#8220;nistP256&#8221;, &#8220;secP256r1&#8221;, &#8220;x962P256v1&#8221;, &#8220;ECDH_P256&#8221;]);<br \/>\nAddEntry(&#8220;1.3.132.0.34&#8221;, &#8220;ECDSA_P384&#8221;, [&#8220;nistP384&#8221;, &#8220;secP384r1&#8221;, &#8220;ECDH_P384&#8221;]);<br \/>\nAddEntry(&#8220;1.3.132.0.35&#8221;, &#8220;ECDSA_P521&#8221;, [&#8220;nistP521&#8221;, &#8220;secP521r1&#8221;, &#8220;ECDH_P521&#8221;]);<\/p>\n<p>allowing the compiler to do the \u201cright thing.\u201d All of those call sites then instead just end up using stack space to store the strings passed to AddEntry, rather than allocating any arrays at all.<\/p>\n<p><strong>Presizing collections.<\/strong> Many collections, such as List&lt;T&gt; or Dictionary&lt;TKey, TValue&gt;, allow you to create a new one, with no a priori knowledge of how large they\u2019ll grow to be, and internally they handle growing their storage to accommodate additional data. The growth algorithm employed typically involves doubling capacity, as doing so strikes a reasonable balance between possibly wasting some memory and not having to re-grow too frequently. However, such growing does have overhead, avoiding it is desirable, and so many collections offer the ability to pre-size the capacity of the collection, e.g. List&lt;T&gt; has a constructor that accepts an int capacity, where the list will immediately create a backing store large enough to accommodate that many elements. The OidCollection in cryptography didn\u2019t have such a capability even though many of the places it was being created did know the exact required size, which in turn results in unnecessary allocation and copying as the collection grows to reach the target size. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97106\">dotnet\/runtime#97106<\/a> added such a constructor internally and used it in various places, in order to avoid that overhead. As with OidCollection, CborWriter also lacked the ability to presize, making the aforementioned growth algorithm problem even more stark. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92538\">dotnet\/runtime#92538<\/a> added such a constructor.<br \/>\n<strong>Avoiding O(N^2) growth algorithms.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92435\">dotnet\/runtime#92435<\/a> from <a href=\"https:\/\/github.com\/MichalPetryka\">@MichalPetryka<\/a> fixes a good example of what happens when you <em>don\u2019t<\/em> employ such a doubling scheme as part of collection resizing. The algorithm used to grow the buffer used by CborWriter would increase the backing buffer by a fixed number of elements each time. A doubling strategy ensures you need no more than O(log N) growth operations, and ensures that N items can be added to a collection in O(N) time, since the number of element copies will be O(2N), which is just O(N) (e.g. if N == 128, and you start with a buffer of size 1, and you grow to 2, then 4, 8, 16, 32, 64, and 128, that\u2019s 1 + 2 + 4 + 8 + 16 + 32 + 64 + 128, which is 255, or just under twice N). But increasing by a fixed number can mean O(N) such operations. And since each growth operation also needs to copy all the elements (assuming the growing is done by array resizing), that makes the algorithm O(N^2). In the extreme, if that fixed number was 1, and we were again growing from 1 to 128 one at a time, that\u2019s just summing all the numbers from 1 to 128, the formula for which is N(N+1)\/2, which is O(N^2). This PR changed CborWriters\u2018s growth strategy to use doubling instead.<br \/>\n\/\/ Add a &lt;PackageReference Include=&#8221;System.Formats.Cbor&#8221; Version=&#8221;8.0.0&#8243; \/&gt; to the csproj.<br \/>\n\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Configs;<br \/>\nusing BenchmarkDotNet.Environments;<br \/>\nusing BenchmarkDotNet.Jobs;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Formats.Cbor;<\/p>\n<p>var config = DefaultConfig.Instance<br \/>\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core80).WithNuGet(&#8220;System.Formats.Cbor&#8221;, &#8220;8.0.0&#8221;).AsBaseline())<br \/>\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core90).WithNuGet(&#8220;System.Formats.Cbor&#8221;, &#8220;9.0.0-rc.1.24431.7&#8221;));<br \/>\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;, &#8220;NuGetReferences&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public CborWriter Test()<br \/>\n    {<br \/>\n        const int NumArrayElements = 100_000;<\/p>\n<p>        CborWriter writer = new();<br \/>\n        writer.WriteStartArray(NumArrayElements);<br \/>\n        for (int i = 0; i &lt; NumArrayElements; i++)<br \/>\n        {<br \/>\n            writer.WriteInt32(i);<br \/>\n        }<br \/>\n        writer.WriteEndArray();<\/p>\n<p>        return writer;<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>Test<br \/>\n.NET 8.0<br \/>\n25,185.2 us<br \/>\n1.00<br \/>\n65350.11 KB<br \/>\n1.00<\/p>\n<p>Test<br \/>\n.NET 9.0<br \/>\n697.2 us<br \/>\n0.03<br \/>\n1023.82 KB<br \/>\n0.02<\/p>\n<p>Of course, improving performance is more than just avoiding allocation. A variety of changes helped in other ways.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99053\">dotnet\/runtime#99053<\/a> \u201cmemoizes\u201d (caches) various properties on CngKey that are accessed multiple times but where the answer stays the same from call to call; it does so simply by adding a few fields to the type to cache these values, which is a significant win if any is accessed multiple times over the lifetime of the object. The affected properties (Algorithm, AlgorithmGroup, and Provider) are particularly expensive because the OS implementation of these functions needs to make a remote procedure call to another Windows process to access the relevant data.<\/p>\n<p>\/\/ Windows-only test.<br \/>\n\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Security.Cryptography;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private RSACng _rsa = new RSACng(2048);<\/p>\n<p>    [GlobalCleanup]<br \/>\n    public void Cleanup() =&gt; _rsa.Dispose();<\/p>\n<p>    [Benchmark]<br \/>\n    public CngAlgorithm GetAlgorithm() =&gt; _rsa.Key.Algorithm;<\/p>\n<p>    [Benchmark]<br \/>\n    public CngAlgorithmGroup? GetAlgorithmGroup() =&gt; _rsa.Key.AlgorithmGroup;<\/p>\n<p>    [Benchmark]<br \/>\n    public CngProvider? GetProvider() =&gt; _rsa.Key.Provider;<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>GetAlgorithm<br \/>\n.NET 8.0<br \/>\n63,619.352 ns<br \/>\n1.000<br \/>\n88 B<br \/>\n1.00<\/p>\n<p>GetAlgorithm<br \/>\n.NET 9.0<br \/>\n10.216 ns<br \/>\n0.000<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>GetAlgorithmGroup<br \/>\n.NET 8.0<br \/>\n62,580.363 ns<br \/>\n1.000<br \/>\n88 B<br \/>\n1.00<\/p>\n<p>GetAlgorithmGroup<br \/>\n.NET 9.0<br \/>\n8.354 ns<br \/>\n0.000<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>GetProvider<br \/>\n.NET 8.0<br \/>\n62,108.489 ns<br \/>\n1.000<br \/>\n232 B<br \/>\n1.00<\/p>\n<p>GetProvider<br \/>\n.NET 9.0<br \/>\n8.393 ns<br \/>\n0.000<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>There were also several improvements related to loading certificates and keys. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97267\">dotnet\/runtime#97267<\/a> from <a href=\"https:\/\/github.com\/birojnayak\">@birojnayak<\/a> addressed an issue on Linux where the same certificate was being processed multiple times rather than just once, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97827\">dotnet\/runtime#97827<\/a> improved the performance of RSA key loading by avoiding some unnecessary work that the key validation was performing.<\/p>\n<h2>Networking<\/h2>\n<p>Quick, when was the last time you worked on a real application or service that didn\u2019t involve networking at all? I\u2019ll wait\u2026 (I\u2019m so funny.) Effectively every modern application relies on networking in one way, shape, or form, especially one that\u2019s following more cloud-native architectures, involving microservices, and the like. Driving down the costs associated with networking is something we take very seriously, and the .NET community whittles away at these costs every release, including in .NET 9.<\/p>\n<p>SslStream has been a key focus for performance optimization in past releases. It\u2019s used by a significant portion of traffic with both HttpClient and the ASP.NET Kestrel web server, putting it on the hot path for many systems. Previous improvements have targeted both steady-state throughput as well as creation overhead.<\/p>\n<p>In .NET 9, a few PRs focused on steady-state throughput, such as <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/95595\">dotnet\/runtime#95595<\/a>, which addressed an issue where some packets were being unnecessarily split into two, leading to extra overhead associated with needing to send and receive that extra packet. This was particularly impactful when writing out exactly 16K, and especially on Windows (where I\u2019ve run this test):<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Net;<br \/>\nusing System.Net.Security;<br \/>\nusing System.Net.Sockets;<br \/>\nusing System.Runtime.InteropServices;<br \/>\nusing System.Security.Authentication;<br \/>\nusing System.Security.Cryptography.X509Certificates;<br \/>\nusing System.Security.Cryptography;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private SslStream _client, _server;<br \/>\n    private byte[] _buffer = new byte[16 * 1024];<br \/>\n    private readonly SslServerAuthenticationOptions _serverOptions = new SslServerAuthenticationOptions<br \/>\n    {<br \/>\n        ServerCertificateContext = SslStreamCertificateContext.Create(GetCertificate(), null),<br \/>\n        EnabledSslProtocols = SslProtocols.Tls13,<br \/>\n    };<br \/>\n    private readonly SslClientAuthenticationOptions _clientOptions = new SslClientAuthenticationOptions<br \/>\n    {<br \/>\n        TargetHost = &#8220;localhost&#8221;,<br \/>\n        RemoteCertificateValidationCallback = delegate { return true; },<br \/>\n        EnabledSslProtocols = SslProtocols.Tls13,<br \/>\n    };<\/p>\n<p>    [GlobalSetup]<br \/>\n    public async Task Setup()<br \/>\n    {<br \/>\n        using var listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);<br \/>\n        listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));<br \/>\n        listener.Listen(1);<\/p>\n<p>        var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp) { NoDelay = true };<br \/>\n        client.Connect(listener.LocalEndPoint!);<\/p>\n<p>        Socket serverSocket = listener.Accept();<br \/>\n        serverSocket.NoDelay = true;<\/p>\n<p>        _client = new SslStream(new NetworkStream(client, ownsSocket: true), leaveInnerStreamOpen: true);<br \/>\n        _server = new SslStream(new NetworkStream(serverSocket, ownsSocket: true), leaveInnerStreamOpen: true);<\/p>\n<p>        await Task.WhenAll(<br \/>\n            _client.AuthenticateAsClientAsync(_clientOptions),<br \/>\n            _server.AuthenticateAsServerAsync(_serverOptions));<br \/>\n    }<\/p>\n<p>    [GlobalCleanup]<br \/>\n    public void Cleanup()<br \/>\n    {<br \/>\n        _client.Dispose();<br \/>\n        _server.Dispose();<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public async Task SendReceive()<br \/>\n    {<br \/>\n        await _client.WriteAsync(_buffer);<br \/>\n        await _server.ReadExactlyAsync(_buffer);<br \/>\n    }<\/p>\n<p>    private static X509Certificate2 GetCertificate()<br \/>\n    {<br \/>\n        X509Certificate2 cert;<br \/>\n        using (RSA rsa = RSA.Create())<br \/>\n        {<br \/>\n            var certReq = new CertificateRequest(&#8220;CN=localhost&#8221;, rsa, HashAlgorithmName.SHA256, RSASignaturePadding.Pkcs1);<br \/>\n            certReq.CertificateExtensions.Add(new X509BasicConstraintsExtension(false, false, 0, false));<br \/>\n            certReq.CertificateExtensions.Add(new X509EnhancedKeyUsageExtension(new OidCollection { new Oid(&#8220;1.3.6.1.5.5.7.3.1&#8221;) }, false));<br \/>\n            certReq.CertificateExtensions.Add(new X509KeyUsageExtension(X509KeyUsageFlags.DigitalSignature, false));<br \/>\n            cert = certReq.CreateSelfSigned(DateTimeOffset.UtcNow.AddMonths(-1), DateTimeOffset.UtcNow.AddMonths(1));<br \/>\n            if (RuntimeInformation.IsOSPlatform(OSPlatform.Windows))<br \/>\n            {<br \/>\n                cert = new X509Certificate2(cert.Export(X509ContentType.Pfx));<br \/>\n            }<br \/>\n        }<br \/>\n        return cert;<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>SendReceive<br \/>\n.NET 8.0<br \/>\n43.07 us<br \/>\n1.00<\/p>\n<p>SendReceive<br \/>\n.NET 9.0<br \/>\n29.38 us<br \/>\n0.68<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100513\">dotnet\/runtime#100513<\/a> also reduced the cost of checking SslStream.IsMutuallyAuthenticated or SslStream.LocalCertificate from a client when a client certificate is being used.<\/p>\n<p>However, the bigger impacts in .NET 9 weren\u2019t on steady-state throughput but rather on TLS connection establishment, aka the handshake. Establishing a TLS connection requires the client and server to engage in a conversation where they agree on details like TLS version, what cipher suite to use, confirm the other is who they say they are, create and exchange dedicated symmetric keys for the communication, and so on. That\u2019s a relatively expensive endeavor. For long-lived connections, that overhead is generally not a big deal, but there are plenty of scenarios where connections are more routinely established and torn down, and for those, we want to drive down the overhead associated with setting up such a connection.<\/p>\n<p><a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/87874\">dotnet\/runtime#87874<\/a> focused on reducing allocations associated with the TLS handshake, by renting some buffers from ArrayPool&lt;byte&gt; rather than always allocating. And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97348\">dotnet\/runtime#97348<\/a> continued the effort by avoiding some unnecessary SafeHandle allocation. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103814\">dotnet\/runtime#103814<\/a> also switched a rarely-needed ConcurrentDictionary&lt;&gt; in the Linux implementation to be lazily allocated rather than always being allocated as part of the handshake. These changes combine to significantly reduce the allocation incurred as part of setting up TLS:<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Net;<br \/>\nusing System.Net.Security;<br \/>\nusing System.Net.Sockets;<br \/>\nusing System.Runtime.InteropServices;<br \/>\nusing System.Security.Authentication;<br \/>\nusing System.Security.Cryptography.X509Certificates;<br \/>\nusing System.Security.Cryptography;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private NetworkStream _client, _server;<br \/>\n    private readonly SslServerAuthenticationOptions _serverOptions = new SslServerAuthenticationOptions<br \/>\n    {<br \/>\n        ServerCertificateContext = SslStreamCertificateContext.Create(GetCertificate(), null),<br \/>\n        EnabledSslProtocols = SslProtocols.Tls13,<br \/>\n    };<br \/>\n    private readonly SslClientAuthenticationOptions _clientOptions = new SslClientAuthenticationOptions<br \/>\n    {<br \/>\n        TargetHost = &#8220;localhost&#8221;,<br \/>\n        RemoteCertificateValidationCallback = delegate { return true; },<br \/>\n        EnabledSslProtocols = SslProtocols.Tls13,<br \/>\n    };<\/p>\n<p>    [GlobalSetup]<br \/>\n    public void Setup()<br \/>\n    {<br \/>\n        using var listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);<br \/>\n        listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));<br \/>\n        listener.Listen(1);<\/p>\n<p>        var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp) { NoDelay = true };<br \/>\n        client.Connect(listener.LocalEndPoint!);<\/p>\n<p>        Socket serverSocket = listener.Accept();<br \/>\n        serverSocket.NoDelay = true;<br \/>\n        _server = new NetworkStream(serverSocket, ownsSocket: true);<br \/>\n        _client = new NetworkStream(client, ownsSocket: true);<br \/>\n    }<\/p>\n<p>    [GlobalCleanup]<br \/>\n    public void Cleanup()<br \/>\n    {<br \/>\n        _client.Dispose();<br \/>\n        _server.Dispose();<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public async Task Handshake()<br \/>\n    {<br \/>\n        using var client = new SslStream(_client, leaveInnerStreamOpen: true);<br \/>\n        using var server = new SslStream(_server, leaveInnerStreamOpen: true);<\/p>\n<p>        await Task.WhenAll(<br \/>\n            client.AuthenticateAsClientAsync(_clientOptions),<br \/>\n            server.AuthenticateAsServerAsync(_serverOptions));<br \/>\n    }<\/p>\n<p>    private static X509Certificate2 GetCertificate()<br \/>\n    {<br \/>\n        X509Certificate2 cert;<br \/>\n        using (RSA rsa = RSA.Create())<br \/>\n        {<br \/>\n            var certReq = new CertificateRequest(&#8220;CN=localhost&#8221;, rsa, HashAlgorithmName.SHA256, RSASignaturePadding.Pkcs1);<br \/>\n            certReq.CertificateExtensions.Add(new X509BasicConstraintsExtension(false, false, 0, false));<br \/>\n            certReq.CertificateExtensions.Add(new X509EnhancedKeyUsageExtension(new OidCollection { new Oid(&#8220;1.3.6.1.5.5.7.3.1&#8221;) }, false));<br \/>\n            certReq.CertificateExtensions.Add(new X509KeyUsageExtension(X509KeyUsageFlags.DigitalSignature, false));<br \/>\n            cert = certReq.CreateSelfSigned(DateTimeOffset.UtcNow.AddMonths(-1), DateTimeOffset.UtcNow.AddMonths(1));<br \/>\n            if (RuntimeInformation.IsOSPlatform(OSPlatform.Windows))<br \/>\n            {<br \/>\n                cert = new X509Certificate2(cert.Export(X509ContentType.Pfx));<br \/>\n            }<br \/>\n        }<br \/>\n        return cert;<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>Handshake<br \/>\n.NET 8.0<br \/>\n2.652 ms<br \/>\n1.00<br \/>\n5.03 KB<br \/>\n1.00<\/p>\n<p>Handshake<br \/>\n.NET 9.0<br \/>\n2.581 ms<br \/>\n0.97<br \/>\n3.3 KB<br \/>\n0.66<\/p>\n<p>Of course, while driving down the costs of doing something is good, avoiding that thing altogether is even better. \u201cTLS resumption\u201d is a capability in the TLS protocol where, if a connection is closed and the same client later opens a new connection to the same server, the client may be able to effectively pick up where it left off with the previous TLS connection rather than starting a brand new one from scratch. Support for TLS resumption on Linux was added in .NET 7, but clients using client certificates weren\u2019t supported\u2026 now in .NET 9 thanks to <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102656\">dotnet\/runtime#102656<\/a>, even such clients can benefit from this significant time saver.<\/p>\n<p>TLS resumption is an optimization where information is stored to enable more efficient operation later. In some ways, it\u2019s not unlike pooling in that regard. We frequently talk about pooling as a way to optimize. Often our conversations are around avoiding allocations, where employing a pool is betting that you can be more efficient than the garbage collector. For small, cheap to create objects, that\u2019s often a bad bet. For larger objects, such as for larger arrays, it can be a good bet, which is why ArrayPool&lt;T&gt; exists and is used throughout the core libraries for getting temporary buffers (see this <a href=\"https:\/\/www.youtube.com\/watch?v=bw7ljmvbrr0\">Deep .NET discussion on ArrayPool<\/a> for more info). But there\u2019s a much more impactful class where pooling is useful, and that\u2019s where the thing being pooled is really expensive to create. Such cases are no longer about memory management, they\u2019re about ammortizing the cost of that creation. And out of all of the pooling done throughout the core libraries, it\u2019s hard to imagine a more impactful case of that then the HTTP connection pool. The objects in this pool represent established connections to an HTTP server, and establishing such connections can be measured in microseconds, or even seconds in certain environments. If such costs had to be paid every single time you were making an HTTP request, it would add huge latency throughout the system. Instead, outgoing HTTP requests try to grab a connection from the connection pool, reusing that connection for the individual request\/response, and then putting the connection back into the pool when done.<\/p>\n<p>However, as with any pool, the pool itself has cost. In the case of the HTTP connection pool in a SocketsHttpHandler instance, the most important factor impacting performance is how quickly a connection can be rented and returned to the pool, especially when under load. That load aspect is important, because as a shared resource, access to this pool must be synchronized, in order to ensure the correctness of the system: it\u2019d be really bad, for example, if two requests went to rent a connection at the same time and ended up incorrectly being given the same connection to use, concurrently. \u201cReally bad\u201d in such a case could not only be corrupting data, but possibly even sending the wrong data to the wrong server. That obviously needs to be avoided. So, synchronization is used, but that synchronization creates a bottleneck, where under load lots of requests can end up being blocked just waiting to check whether a connection is even available. Over the years we\u2019ve whittled away at that cost, but it gets even lower in .NET 9, in particular for HTTP\/1.1 connections (we talk about \u201cthe\u201d pool, but in reality connections are only pooled together when they\u2019re interchangeable, so there are actually many groupings of connections, for example with HTTP\/1.1 connections separate from HTTP\/2 connections or HTTP\/3 connections, a separate pool for each endpoint, etc.). <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/99364\">dotnet\/runtime#99364<\/a> changes the synchronization mechanism from using a pure lock-based scheme to a more opportunistic concurrency scheme that employs a first-layer of lockless synchronization. There\u2019s now still a lock, but for the hot path it\u2019s avoided as long as there are connections in the pool by using a ConcurrentStack&lt;T&gt;, such that renting is a TryPop and returning is a Push. ConcurrentStack&lt;T&gt; itself uses a lock-free algorithm, that\u2019s a lot more scalable than a lock. There is an interesting downside to ConcurrentStack&lt;T&gt;, which is that the algorithm it employs necessarily involves an allocation per pushed element, and for reasons related to the <a href=\"https:\/\/en.wikipedia.org\/wiki\/ABA_problem\">\u201cABA\u201d problem<\/a>, those allocations can\u2019t be pooled. That means that every time a connection is returned to the pool now, there\u2019s a small allocation. However, for an HTTP request\/response, even though we\u2019ve significantly reduced it over the years, there\u2019s still a non-trivial amount of allocation that occurs over the lifetime of the operation, so one more tiny one doesn\u2019t break the bank, and it\u2019s worth it for the reduced synchronization overheads. We\u2019ve experimented with other data structures, such as ConcurrentQueue&lt;T&gt; (which is able to avoid allocation per Enqueue at steady state), but they\u2019ve had other downsides. I expect we\u2019ll continue to push on this in the future, but for now, what\u2019s there now is a nice improvement.<\/p>\n<p>Of course once you\u2019ve got the connection, there\u2019s all of the costs associated with actually making the request and handling the response, and those have been whittled away at as well.<\/p>\n<p><strong>Using vectorized helpers.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/93511\">dotnet\/runtime#93511<\/a> replaces a scalar loop for writing out bytes from an HTTP\/1.1 request header, instead using Ascii.FromUtf16, which is vectorized. In fact, that vectorization was further improved by <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102735\">dotnet\/runtime#102735<\/a>, which improved the 256-bit and 512-bit code paths by using better instructions possible due to not having to care about edge cases already weeded out.<br \/>\n<strong>Avoiding extra async state machines.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100205\">dotnet\/runtime#100205<\/a> avoids an extra layer of async state machine that was incurred by most of the HTTP\/1.1 response streams; a method was async only to accommodate some rare logging being enabled, so the async wrapper is now only employed when that logging is enabled.<br \/>\n<strong>More caching of very common data.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100177\">dotnet\/runtime#100177<\/a> removes some more allocation and reduces some overheads by computing and caching some bytes that need to be written out on every request.<br \/>\n<strong>Special-casing main use cases.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102859\">dotnet\/runtime#102859<\/a> and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103008\">dotnet\/runtime#103008<\/a> made JsonContent and StringContent cheaper by special-casing the vastly most common media types used. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/93759\">dotnet\/runtime#93759<\/a> further improved JsonContent by reducing the number of async frames on the hot path. And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102845\">dotnet\/runtime#102845<\/a> from <a href=\"https:\/\/github.com\/pedrobsaila\">@pedrobsaila<\/a> made TryAddWithoutValidation cheaper for multiple values by special-casing the most common case of the input being an IList&lt;string&gt;, which enables presizing arrays while also avoiding an enumerator allocation.<br \/>\n<strong>More ArrayPool.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103764\">dotnet\/runtime#103764<\/a> avoided a possibly large char[] allocation in the parsing of Alt-Svc headers by using ArrayPool rather than direct allocation.<\/p>\n<p>While this simple benchmark doesn\u2019t touch on all of these changes, it does highlight that the end-to-end performance of HTTP requests gets cheaper:<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Net.Sockets;<br \/>\nusing System.Net;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private static readonly Socket s_listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);<br \/>\n    private static readonly HttpMessageInvoker s_client = new(new SocketsHttpHandler());<br \/>\n    private static Uri? s_uri;<\/p>\n<p>    [Benchmark]<br \/>\n    public async Task HttpGet()<br \/>\n    {<br \/>\n        var m = new HttpRequestMessage(HttpMethod.Get, s_uri) { Content = new StringContent(&#8220;Hello, there! How are you today?&#8221;) };<br \/>\n        using (HttpResponseMessage r = await s_client.SendAsync(m, default))<br \/>\n        using (Stream s = r.Content.ReadAsStream())<br \/>\n            await s.CopyToAsync(Stream.Null);<br \/>\n    }<\/p>\n<p>    [GlobalSetup]<br \/>\n    public void CreateSocketServer()<br \/>\n    {<br \/>\n        s_listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));<br \/>\n        s_listener.Listen(int.MaxValue);<br \/>\n        var ep = (IPEndPoint)s_listener.LocalEndPoint!;<br \/>\n        s_uri = new Uri($&#8221;http:\/\/{ep.Address}:{ep.Port}\/&#8221;);<\/p>\n<p>        Task.Run(async () =&gt;<br \/>\n        {<br \/>\n            while (true)<br \/>\n            {<br \/>\n                Socket s = await s_listener.AcceptAsync();<br \/>\n                _ = Task.Run(() =&gt;<br \/>\n                {<br \/>\n                    using (var ns = new NetworkStream(s, true))<br \/>\n                    {<br \/>\n                        byte[] buffer = new byte[1024];<br \/>\n                        int totalRead = 0;<br \/>\n                        while (true)<br \/>\n                        {<br \/>\n                            int read = ns.Read(buffer, totalRead, buffer.Length &#8211; totalRead);<br \/>\n                            if (read == 0) return;<br \/>\n                            totalRead += read;<br \/>\n                            if (buffer.AsSpan(0, totalRead).IndexOf(&#8220;rnrn&#8221;u8) == -1)<br \/>\n                            {<br \/>\n                                if (totalRead == buffer.Length) Array.Resize(ref buffer, buffer.Length * 2);<br \/>\n                                continue;<br \/>\n                            }<\/p>\n<p>                            ns.Write(&#8220;HTTP\/1.1 200 OKrnDate: Sun, 05 Jul 2020 12:00:00 GMT rnServer: ExamplernContent-Length: 5rnrnHello&#8221;u8);<\/p>\n<p>                            totalRead = 0;<br \/>\n                        }<br \/>\n                    }<br \/>\n                });<br \/>\n            }<br \/>\n        });<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>HttpGet<br \/>\n.NET 8.0<br \/>\n92.42 us<br \/>\n1.00<br \/>\n1.98 KB<br \/>\n1.00<\/p>\n<p>HttpGet<br \/>\n.NET 9.0<br \/>\n77.13 us<br \/>\n0.83<br \/>\n1.8 KB<br \/>\n0.91<\/p>\n<p>Related to HTTP, the WebUtility and HttpUtility types both got more efficient this release. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103737\">dotnet\/runtime#103737<\/a>, in particular, made a variety of changes that have a measurable impact on HtmlEncode and UrlEncode:<\/p>\n<p>HtmlEncode had a scalar loop looking for characters that need to be encoded. That loop can instead be vectorized by using SearchValues&lt;char&gt;.<br \/>\nUrlEncode also had a simlar scalar loop as part of looking for the first non-safe character. SearchValues&lt;char&gt; can also solve this.<br \/>\nUrlEncode had a complicated scheme where it would UTF8-encode into a newly-allocated byte[], percent-encode in place in that (thanks to the ability to reinterpret cast with spans), and then use the resulting chars to create a new string. Instead, string.Create can be used, with all of the work done in-place in the buffer generated for that operation.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Net;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    [Arguments(&#8220;&#8221;&#8221;<br \/>\n        How much wood could a woodchuck chuck<br \/>\n        If a woodchuck could chuck wood?<br \/>\n        A woodchuck would chuck as much wood<br \/>\n        As much wood as a woodchuck could chuck,<br \/>\n        If a woodchuck could chuck wood.<br \/>\n        &#8220;&#8221;&#8221;)]<br \/>\n    public string HtmlEncode(string input) =&gt; WebUtility.HtmlEncode(input);<\/p>\n<p>    [Benchmark]<br \/>\n    [Arguments(&#8220;short_name.txt&#8221;)]<br \/>\n    public string UrlEncode(string input) =&gt; WebUtility.UrlEncode(input);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\ninput<br \/>\nMean<br \/>\nRatio<\/p>\n<p>HtmlEncode<br \/>\n.NET 8.0<br \/>\nHow (\u2026)ood. [181]<br \/>\n102.607 ns<br \/>\n1.00<\/p>\n<p>HtmlEncode<br \/>\n.NET 9.0<br \/>\nHow (\u2026)ood. [181]<br \/>\n10.188 ns<br \/>\n0.10<\/p>\n<p>UrlEncode<br \/>\n.NET 8.0<br \/>\nshort_name.txt<br \/>\n8.656 ns<br \/>\n1.00<\/p>\n<p>UrlEncode<br \/>\n.NET 9.0<br \/>\nshort_name.txt<br \/>\n2.463 ns<br \/>\n0.28<\/p>\n<p>HttpUtility also received some attention. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102805\">dotnet\/runtime#102805<\/a> from <a href=\"https:\/\/github.com\/TrayanZapryanov\">@TrayanZapryanov<\/a> updated UrlEncodeToBytes, using stack space instead of allocation for smaller inputs, and using SearchValues&lt;byte&gt; to optimize the search for invalid bytes. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102753\">dotnet\/runtime#102753<\/a> from <a href=\"https:\/\/github.com\/TrayanZapryanov\">@TrayanZapryanov<\/a> did the same for UrlDecodeToBytes. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102909\">dotnet\/runtime#102909<\/a> from <a href=\"https:\/\/github.com\/TrayanZapryanov\">@TrayanZapryanov<\/a> similarly reduced allocation in UrlPathEncode, but via the ArrayPool. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102917\">dotnet\/runtime#102917<\/a> from <a href=\"https:\/\/github.com\/TrayanZapryanov\">@TrayanZapryanov<\/a> optimized JavaScriptStringEncode, in particular by using SearchValues. And <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/102745\">dotnet\/runtime#102745<\/a> from <a href=\"https:\/\/github.com\/TrayanZapryanov\">@TrayanZapryanov<\/a> optimized ParseQueryString by using stackalloc instead of array allocation for smaller input lengths and by replacing string.Substrings with span slicing.<\/p>\n<p>There were also changes elsewhere in the networking stack that contribute to HTTP use cases. In <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98074\">dotnet\/runtime#98074<\/a>, for example, Uri gained new TryEscapeDataString and TryUnescapeDataString methods that store the output characters into a provided destination span rather than allocating new strings on each call. These methods were then used elsewhere in the networking stack, such as in FormUrlEncodedContent, to improve throughput and reduce allocation.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private KeyValuePair&lt;string, string&gt;[] _data =<br \/>\n    [<br \/>\n        new(&#8220;key1&#8221;, &#8220;value1&#8221;),<br \/>\n        new(&#8220;key2&#8221;, &#8220;value2&#8221;),<br \/>\n        new(&#8220;key3&#8221;, &#8220;value3&#8221;),<br \/>\n        new(&#8220;key4&#8221;, &#8220;value4&#8221;)<br \/>\n    ];<\/p>\n<p>    [Benchmark]<br \/>\n    public FormUrlEncodedContent Create() =&gt; new FormUrlEncodedContent(_data);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>Create<br \/>\n.NET 8.0<br \/>\n311.5 ns<br \/>\n1.00<br \/>\n848 B<br \/>\n1.00<\/p>\n<p>Create<br \/>\n.NET 9.0<br \/>\n218.7 ns<br \/>\n0.70<br \/>\n384 B<br \/>\n0.45<\/p>\n<p>Beyond raw HTTP, there were also some new features for WebSocket in .NET 9, namely support for keep-alive pings and timeouts, though not many PRs focused solely on performance (though <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101953\">dotnet\/runtime#101953<\/a> from <a href=\"https:\/\/github.com\/PaulusParssinen\">@PaulusParssinen<\/a> did utilize some newer APIs in ManagedWebSocket in a way that may have removed a bit of fat). There was one notable improvement, however, in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104865\">dotnet\/runtime#104865<\/a>. The web sockets <a href=\"https:\/\/www.rfc-editor.org\/rfc\/rfc6455\">RFC 6455<\/a> specification requires that when a data frame\u2019s payload data has an opcode denoting it as text, that text must be checked to be valid UTF8-encoded bytes. The validation for that had been a hand-rolled scalar comparison loop. However, now that Utf8.IsValid exists (it was introduced in .NET 8), that accelerated method can be used here instead. It can\u2019t be used in all situations, which is probably why it wasn\u2019t immediately employed when the method was added in the first place. Web sockets payloads may be split across data frames, so it\u2019s possible that the frame being validated is actually the continuation of some previously-received data, and it\u2019s possible that this frame is not the end of the payload, either. But, we know those two pieces of information up-front: if it\u2019s a continuation from a previous frame, we would have already noted it as such, and if it\u2019s not complete, its end-of-message bit won\u2019t have been set. Thus, for the common case where the payload is complete, we can use the accelerated helper for UTF8 validation, and only fall back to the slower path for the corner cases. And this matters because even with the networking costs involved, that UTF8 validation shows up.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Net;<br \/>\nusing System.Net.Sockets;<br \/>\nusing System.Net.WebSockets;<br \/>\nusing System.Text;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private WebSocket _client, _server;<br \/>\n    private Memory&lt;byte&gt; _buffer = Encoding.UTF8.GetBytes(&#8220;&#8221;&#8221;<br \/>\n        Shall I compare thee to a summer\u2019s day?<br \/>\n        Thou art more lovely and more temperate:<br \/>\n        Rough winds do shake the darling buds of May,<br \/>\n        And summer\u2019s lease hath all too short a date;<br \/>\n        Sometime too hot the eye of heaven shines,<br \/>\n        And often is his gold complexion dimm&#8217;d;<br \/>\n        And every fair from fair sometime declines,<br \/>\n        By chance or nature\u2019s changing course untrimm&#8217;d;<br \/>\n        But thy eternal summer shall not fade,<br \/>\n        Nor lose possession of that fair thou ow\u2019st;<br \/>\n        Nor shall death brag thou wander\u2019st in his shade,<br \/>\n        When in eternal lines to time thou grow\u2019st:<br \/>\n        So long as men can breathe or eyes can see,<br \/>\n        So long lives this, and this gives life to thee.<br \/>\n        &#8220;&#8221;&#8221;);<br \/>\n    private Memory&lt;byte&gt; _tmp = new byte[1024];<\/p>\n<p>    [GlobalSetup]<br \/>\n    public void Setup()<br \/>\n    {<br \/>\n        using var listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);<br \/>\n        listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));<br \/>\n        listener.Listen();<\/p>\n<p>        var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);<br \/>\n        client.Connect(listener.LocalEndPoint!);<br \/>\n        Socket server = listener.Accept();<\/p>\n<p>        _client = WebSocket.CreateFromStream(new NetworkStream(client, ownsSocket: true), new WebSocketCreationOptions { IsServer = false, });<br \/>\n        _server = WebSocket.CreateFromStream(new NetworkStream(server, ownsSocket: true), new WebSocketCreationOptions { IsServer = true });<br \/>\n    }<\/p>\n<p>    [GlobalCleanup]<br \/>\n    public void Cleanup()<br \/>\n    {<br \/>\n        _client.Dispose();<br \/>\n        _server.Dispose();<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public async Task SendReceive()<br \/>\n    {<br \/>\n        await _client.SendAsync(_buffer, WebSocketMessageType.Text, true, default);<br \/>\n        while (!(await _server.ReceiveAsync(_tmp, default)).EndOfMessage) ;<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>SendReceive<br \/>\n.NET 8.0<br \/>\n4.093 us<br \/>\n1.00<\/p>\n<p>SendReceive<br \/>\n.NET 9.0<br \/>\n3.438 us<br \/>\n0.84<\/p>\n<p>There are of course a variety of reasons that performance could have improved, e.g. maybe WebSockets is just exercising a code path that benefits from one of the other optimizations already discussed. How do we know it\u2019s connected to the validation? Let\u2019s profile. And since we already have a benchmark written, we can just use it. There\u2019s another very handy nuget package, Microsoft.VisualStudio.DiagnosticsHub.BenchmarkDotNetDiagnosers, which contains additional \u201cdiagnosers\u201d for BenchmarkDotNet. Diagnosers are one of the main extensibility points within BenchmarkDotNet, enabling developers to perform additional tracking and analyses over benchmarks. You\u2019ve already seen me use some, including the built-in [MemoryDiagnoser(false)] and [DisassemblyDiagnoser]; there are other built-in ones we haven\u2019t used in this post but that are helpful in various situations, like [ThreadingDiagnoser] and [ExceptionDiagnoser], but diagnosers can come from anywhere, and the aforementioned nuget package provides several more. The purpose of those diagnosers is to collect and export performance traces that Visual Studio\u2019s performance tools can then consume. In my case, I want to collect a CPU trace, so as to understand where CPU consumption is going, so I added a [CPUUsageDiagnoser] attribute to my Tests class:<\/p>\n<p>[CPUUsageDiagnoser]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<\/p>\n<p>and then re-ran (on Windows). That\u2019s it. While the test is running, you\u2019ll see the same output as you\u2019re used to seeing, plus a little more. For example, at the end of the benchmarking, I now also see this:<\/p>\n<p>\/\/ * Diagnostic Output &#8211; VSDiagnosticsDiagnoser *<br \/>\nCollection result moved to &#8216;BenchmarkDotNet_Tests_20240804_081400.diagsession&#8217;.<br \/>\nSession : {a1671047-d6da-4a56-9c71-eadef6c1dd00}<br \/>\n  Stopped<br \/>\nExported diagsession file: d:BenchmarksBenchmarkDotNet_Tests_20240804_081400.diagsession.<\/p>\n<p>I then simply opened that .diagsession file, just typing its name at the command-line, since that file extension is by default associated with Visual Studio, but you could also File-&gt;Open from within Visual Studio itself. That results in a view like the following:<\/p>\n<p>Notice this single trace covers both the .NET 8 test execution and the .NET 9 test execution, and each is represented by a different entry in the Benchmarks table (but both are on the same execution timeline). I can then double-click one of the tests to narrow the timeline down to just the relevant portion of activity, and then switch over to the CPU Usage tab. When I do, here\u2019s what I see for .NET 8 for the top impacting methods:<\/p>\n<p>and for .NET 9:<\/p>\n<p>Notice in the first trace that TryValidateUtf8 is taking up almost 8% of the CPU time, but it doesn\u2019t show up in the second trace at all, instead being replaced by Utf8Utility.GetPointerToFirstInvalidByte, which is the implementation of Utf8.IsValid and which is only half a percent. That ~8% correlates with the ~10% reduction we saw in benchmark execution time. Neat.<\/p>\n<h2>JSON<\/h2>\n<p>System.Text.Json hit the scene in .NET Core 3.0, and every release since it\u2019s gotten more capable and more efficient. .NET 9 is no exception. In addition to new features like support for exporting JSON schema, deep semantic equality comparison of JsonElements, the ability to respect nullable reference type annotations, support for ordering JsonObject properties, new contract metadata APIs, and more, performance has also been a significant focus.<\/p>\n<p>One improvement comes from the integration of JsonSerializer with System.IO.Pipelines. Much of the .NET stack moves bytes around via Stream, however ASP.NET internally is implemented with System.IO.Pipelines. There are built-in bidirectional adapters between streams and pipes, but in some cases those adapters add some overhead. As JSON is so critical to modern services, it\u2019s important that JsonSerializer be able to work equally well with both streams and pipes. As such, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/101461\">dotnet\/runtime#101461<\/a> adds new JsonSerializer.SerializeAsync overloads that target PipeWriter in addition to the existing overloads that target Stream. That way, whether you have a Stream or a PipeWriter, JsonSerializer will natively work with either without requiring any indirection to adapt between them. Just use whichever you already have.<\/p>\n<p>JsonSerializer\u2018s handling of enums was also improved by <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/105032\">dotnet\/runtime#105032<\/a>. In addition to adding support for the new [JsonEnumMemberName] attribute, it also moved to an allocation-free parsing solution for enums, utilizing the GetAlternateLookup support added to Dictionary&lt;TKey, TValue&gt; and ConcurrentDictionary&lt;TKey, TValue&gt; to enable a cache of enum information queryable via a ReadOnlySpan&lt;char&gt;.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Text.Json;<br \/>\nusing System.Reflection;<br \/>\nusing System.Text.Json.Serialization;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private static readonly JsonSerializerOptions s_options = new()<br \/>\n    {<br \/>\n        Converters = { new JsonStringEnumConverter() },<br \/>\n        DictionaryKeyPolicy = JsonNamingPolicy.KebabCaseLower,<br \/>\n    };<\/p>\n<p>    [Params(BindingFlags.Default, BindingFlags.NonPublic | BindingFlags.Instance)]<br \/>\n    public BindingFlags _value;<\/p>\n<p>    private byte[] _jsonValue;<br \/>\n    private Utf8JsonWriter _writer = new(Stream.Null);<\/p>\n<p>    [GlobalSetup]<br \/>\n    public void Setup() =&gt; _jsonValue = JsonSerializer.SerializeToUtf8Bytes(_value, s_options);<\/p>\n<p>    [Benchmark]<br \/>\n    public void Serialize()<br \/>\n    {<br \/>\n        _writer.Reset();<br \/>\n        JsonSerializer.Serialize(_writer, _value, s_options);<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public BindingFlags Deserialize() =&gt;<br \/>\n        JsonSerializer.Deserialize&lt;BindingFlags&gt;(_jsonValue, s_options);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\n_value<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>Serialize<br \/>\n.NET 8.0<br \/>\nDefault<br \/>\n38.67 ns<br \/>\n1.00<br \/>\n24 B<br \/>\n1.00<\/p>\n<p>Serialize<br \/>\n.NET 9.0<br \/>\nDefault<br \/>\n27.23 ns<br \/>\n0.70<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>Deserialize<br \/>\n.NET 8.0<br \/>\nDefault<br \/>\n73.86 ns<br \/>\n1.00<br \/>\n\u2013<br \/>\nNA<\/p>\n<p>Deserialize<br \/>\n.NET 9.0<br \/>\nDefault<br \/>\n70.48 ns<br \/>\n0.95<br \/>\n\u2013<br \/>\nNA<\/p>\n<p>Serialize<br \/>\n.NET 8.0<br \/>\nInstance, NonPublic<br \/>\n37.60 ns<br \/>\n1.00<br \/>\n24 B<br \/>\n1.00<\/p>\n<p>Serialize<br \/>\n.NET 9.0<br \/>\nInstance, NonPublic<br \/>\n26.82 ns<br \/>\n0.71<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>Deserialize<br \/>\n.NET 8.0<br \/>\nInstance, NonPublic<br \/>\n97.54 ns<br \/>\n1.00<br \/>\n\u2013<br \/>\nNA<\/p>\n<p>Deserialize<br \/>\n.NET 9.0<br \/>\nInstance, NonPublic<br \/>\n70.72 ns<br \/>\n0.73<br \/>\n\u2013<br \/>\nNA<\/p>\n<p>JsonSerializer relies on lots of other functionality from System.Text.Json, which has also improved. Here\u2019s a sampling:<\/p>\n<p><strong>Direct use of UTF8.<\/strong> JsonProperty.WriteTo would always use writer.WritePropertyName(Name) to output the property name. However, that Name property might end up allocating a new string if the JsonProperty wasn\u2019t already caching one. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/90074\">dotnet\/runtime#90074<\/a> from <a href=\"https:\/\/github.com\/karakasa\">@karakasa<\/a> tweaked the implementation to write out the string if it already had one, or else to directly write out a name based on the UTF8 bytes it would have used to create that string.<br \/>\n<strong>Avoiding unnecessary intermediate state.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/97687\">dotnet\/runtime#97687<\/a> from <a href=\"https:\/\/github.com\/habbes\">@habbes<\/a> is one of those lovely PRs that\u2019s a pure win. The primary change here is to a Base64EncodeAndWrite method that\u2019s Base64-encoding a source ReadOnlySpan&lt;byte&gt; to a destination Span&lt;byte&gt;. The implementation was either stackalloc\u2018ing a buffer or renting a buffer, then encoding into that temporary, and then copying the data into a buffer that is guaranteed to be large enough. Why wasn\u2019t it just encoding directly into that destination buffer rather than going through a temporary? Unclear. But thanks to this PR, that intermediate overhead was simply removed. Similarly, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92284\">dotnet\/runtime#92284<\/a> removed some unnecessary intermediate state from JsonNode.GetPath. JsonNode.GetPath was doing a lot of allocation, creating a List&lt;string&gt; of all of the path segments which were then combined in reverse order into a StringBuilder. This changes the implementation to extract the path segments in the reverse order in the first place, then building up the resulting path in stack space or an array rented from the ArrayPool.<br \/>\n\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Text.Json.Nodes;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private JsonNode _json = JsonNode.Parse(&#8220;&#8221;&#8221;<br \/>\n        {<br \/>\n            &#8220;staff&#8221;: {<br \/>\n                &#8220;Elsa&#8221;: {<br \/>\n                    &#8220;age&#8221;: 21,<br \/>\n                    &#8220;position&#8221;: &#8220;queen&#8221;<br \/>\n                }<br \/>\n            }<br \/>\n        }<br \/>\n        &#8220;&#8221;&#8221;)[&#8220;staff&#8221;][&#8220;Elsa&#8221;][&#8220;position&#8221;];<\/p>\n<p>    [Benchmark]<br \/>\n    public string GetPath() =&gt; _json.GetPath();<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\n_value<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>GetPath<br \/>\n.NET 8.0<br \/>\nDefault<br \/>\n176.68 ns<br \/>\n1.00<br \/>\n472 B<br \/>\n1.00<\/p>\n<p>GetPath<br \/>\n.NET 9.0<br \/>\nDefault<br \/>\n27.23 ns<br \/>\n0.30<br \/>\n64 B<br \/>\n0.14<\/p>\n<p><strong>Using existing caches.<\/strong> JsonNode.ToString and JsonNode.ToJsonString were allocating a new PooledByteBufferWriter and Utf8JsonWriter, but the internal Utf8JsonWriterCache type already provides support for using cached versions of these same objects. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/92358\">dotnet\/runtime#92358<\/a> just updated these JsonNode methods to utilize the existing cache.<br \/>\n<strong>Pre-sizing collections.<\/strong> JsonObject has a constructor that accepts an enumerable of properties to add to the object. For a lot of properties, as it\u2019s adding properties, the backing store may need to keep growing and growing, incurring the overhead of allocation and copies. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/96486\">dotnet\/runtime#96486<\/a> from <a href=\"https:\/\/github.com\/olo-ntaylor\">@olo-ntaylor<\/a> tests to see whether a count can be retrieved from the enumerable, and if it can, it uses that count to pre-size the dictionary.<br \/>\n<strong>Allow fast paths to be fast.<\/strong> JsonValue has a niche feature that enables it to wrap an arbitrary .NET object. As JsonValue derives from JsonNode, JsonNode needs to take that capability into account. The current way it does so makes some common operations much more expensive than they\u2019d need to be. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103733\">dotnet\/runtime#103733<\/a> refactors the implementation to optimize for the common cases.<br \/>\n\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Text.Json;<br \/>\nusing System.Text.Json.Nodes;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private JsonNode[] _nodes = [42, &#8220;I am a string&#8221;, false, DateTimeOffset.Now];<\/p>\n<p>    [Benchmark]<br \/>\n    [Arguments(JsonValueKind.String)]<br \/>\n    public int Count(JsonValueKind kind)<br \/>\n    {<br \/>\n        var count = 0;<br \/>\n        foreach (var node in _nodes)<br \/>\n        {<br \/>\n            if (node.GetValueKind() == kind)<br \/>\n            {<br \/>\n                count++;<br \/>\n            }<br \/>\n        }<\/p>\n<p>        return count;<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nkind<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Count<br \/>\n.NET 8.0<br \/>\nString<br \/>\n729.26 ns<br \/>\n1.00<\/p>\n<p>Count<br \/>\n.NET 9.0<br \/>\nString<br \/>\n12.14 ns<br \/>\n0.02<\/p>\n<p><strong>Deduplicating accesses.<\/strong> JsonValue.CreateFromElement accesses JsonElement.ValueKind to determine how to process the data, e.g.<br \/>\nif (element.ValueKind is JsonValueKind.Null) { &#8230; }<br \/>\nelse if (element.ValueKind is JsonValueKind.Object or JsonValueKind.Array) { &#8230; }<br \/>\nelse { &#8230; }<\/p>\n<p>If ValueKind were a simple field access, that\u2019d be fine. But it\u2019s a bit more complicated than that, involving a large switch to determine what kind to return. Rather than possibly reading from it twice, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104108\">dotnet\/runtime#104108<\/a> from <a href=\"https:\/\/github.com\/andrewjsaid\">@andrewjsaid<\/a> just makes a small tweak to only access the property once. No point in doing that work twice.<\/p>\n<p><strong>Spans over existing data.<\/strong> The JsonElement.GetRawText method is useful for extracting the original input backing the JsonElement, but the data is stored as UTF8 bytes and GetRawText returns a string, so every call allocates and transcodes to produce the result. From <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104595\">dotnet\/runtime#104595<\/a>, the new JsonMarshal.GetRawUtf8Value simply returns a span over the original data, no encoding, no allocation.<br \/>\n\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Runtime.InteropServices;<br \/>\nusing System.Text.Json;<br \/>\nusing System.Text.Json.Nodes;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private JsonElement _json = JsonSerializer.Deserialize(&#8220;&#8221;&#8221;<br \/>\n        {<br \/>\n            &#8220;staff&#8221;: {<br \/>\n                &#8220;Elsa&#8221;: {<br \/>\n                    &#8220;age&#8221;: 21,<br \/>\n                    &#8220;position&#8221;: &#8220;queen&#8221;<br \/>\n                }<br \/>\n            }<br \/>\n        }<br \/>\n        &#8220;&#8221;&#8221;);<\/p>\n<p>    [Benchmark(Baseline = true)]<br \/>\n    public string GetRawText() =&gt; _json.GetRawText();<\/p>\n<p>    [Benchmark]<br \/>\n    public ReadOnlySpan&lt;byte&gt; TryGetRawText() =&gt; JsonMarshal.GetRawUtf8Value(_json);<br \/>\n}<\/p>\n<p>Method<br \/>\nMean<br \/>\nRatio<br \/>\nAllocated<br \/>\nAlloc Ratio<\/p>\n<p>GetRawText<br \/>\n51.627 ns<br \/>\n1.00<br \/>\n192 B<br \/>\n1.00<\/p>\n<p>TryGetRawText<br \/>\n7.998 ns<br \/>\n0.15<br \/>\n\u2013<br \/>\n0.00<\/p>\n<p>Note that the new method is on the new JsonMarshal class because it\u2019s an API with safety concerns (in general, APIs on the Unsafe class or in the System.Runtime.InteropServices namespace are considered \u201cunsafe\u201d). The concern here is that the JsonElement might be backed by an array rented from the ArrayPool, if the JsonElement came from a JsonDocument. The ReadOnlySpan&lt;byte&gt; you get back is simply pointing into that array. If after getting the span, the JsonDocument is disposed, it\u2019ll return that array back to the pool, and now the span is referencing an array that someone else might rent. If they do and write into that array, the span will now contain whatever was written there, effectively yielding corrupted data. Try this:<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0<\/p>\n<p>using System.Runtime.InteropServices;<br \/>\nusing System.Text.Json;<br \/>\nusing System.Text;<\/p>\n<p>ReadOnlySpan&lt;byte&gt; elsaUtf8;<br \/>\nusing (JsonDocument elsaJson = JsonDocument.Parse(&#8220;&#8221;&#8221;<br \/>\n        {<br \/>\n          &#8220;staff&#8221;: {<br \/>\n            &#8220;Elsa&#8221;: {<br \/>\n              &#8220;age&#8221;: 21,<br \/>\n              &#8220;position&#8221;: &#8220;queen&#8221;<br \/>\n            }<br \/>\n          }<br \/>\n        }<br \/>\n        &#8220;&#8221;&#8221;))<br \/>\n{<br \/>\n    elsaUtf8 = JsonMarshal.GetRawUtf8Value(elsaJson.RootElement);<br \/>\n}<\/p>\n<p>using (JsonDocument annaJson = JsonDocument.Parse(&#8220;&#8221;&#8221;<br \/>\n        {<br \/>\n          &#8220;staff&#8221;: {<br \/>\n            &#8220;Anna&#8221;: {<br \/>\n              &#8220;age&#8221;: 18,<br \/>\n              &#8220;position&#8221;: &#8220;princess&#8221;<br \/>\n            }<br \/>\n          }<br \/>\n        }<br \/>\n        &#8220;&#8221;&#8221;))<br \/>\n{<br \/>\n    Console.WriteLine(Encoding.UTF8.GetString(elsaUtf8)); \/\/ uh oh!<br \/>\n}<\/p>\n<p>When I run that, it prints out the information about \u201cAnna,\u201d even though I retrieved the raw text from the \u201cElsa\u201d JsonElement. Oops! As with anything in C# or .NET that\u2019s \u201cunsafe,\u201d you just need to make sure you hold it correctly.<\/p>\n<p>One last improvement I want to call out. The feature itself is not actually about performance, but the workarounds I\u2019ve seen folks employ for the lack of this capability do have a significant performance impact, and so having the feature built-in will be a net performance win. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104328\">dotnet\/runtime#104328<\/a> adds support to both Utf8JsonReader and JsonSerializer for parsing out multiple top-level JSON objects from an input. Previously if any data was found after a JSON object in the input, that would be considered erroneous and fail to parse, and that means that if a particular data source served up multiple JSON objects one after the other, the data would need to be pre-parsed in order to feed only the relevant portions to System.Text.Json. This is particularly relevant with services that stream data, as some of them use such a format.<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Text.Json;<br \/>\nusing System.Text.Json.Nodes;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private MemoryStream _source = new MemoryStream(&#8220;&#8221;&#8221;<br \/>\n        {<br \/>\n          &#8220;name&#8221;: &#8220;Alice&#8221;,<br \/>\n          &#8220;age&#8221;: 30,<br \/>\n          &#8220;city&#8221;: &#8220;New York&#8221;<br \/>\n        }<\/p>\n<p>        {<br \/>\n          &#8220;name&#8221;: &#8220;Bob&#8221;,<br \/>\n          &#8220;age&#8221;: 25,<br \/>\n          &#8220;city&#8221;: &#8220;Los Angeles&#8221;<br \/>\n        }<\/p>\n<p>        {<br \/>\n          &#8220;name&#8221;: &#8220;Charlie&#8221;,<br \/>\n          &#8220;age&#8221;: 35,<br \/>\n          &#8220;city&#8221;: &#8220;Chicago&#8221;<br \/>\n        }<br \/>\n        &#8220;&#8221;&#8221;u8.ToArray());<\/p>\n<p>    [Benchmark]<br \/>\n    [Arguments(&#8220;Dave&#8221;)]<br \/>\n    public async Task&lt;Person?&gt; FindAsync(string name)<br \/>\n    {<br \/>\n        _source.Position = 0;<\/p>\n<p>        await foreach (var p in JsonSerializer.DeserializeAsyncEnumerable&lt;Person&gt;(_source, topLevelValues: true))<br \/>\n        {<br \/>\n            if (p?.Name == name)<br \/>\n            {<br \/>\n                return p;<br \/>\n            }<br \/>\n        }<\/p>\n<p>        return null;<br \/>\n    }<\/p>\n<p>    public class Person<br \/>\n    {<br \/>\n        public string Name { get; set; }<br \/>\n        public int Age { get; set; }<br \/>\n        public string City { get; set; }<br \/>\n    }<br \/>\n}<\/p>\n<h2>Diagnostics<\/h2>\n<p>Being able to observe one\u2019s application in production is critical to the operation of modern services. System.Diagnostics.Metrics.Meter is .NET\u2019s recommended type for emitting metrics, and several improvements have gone into making it more efficient in .NET 9.<\/p>\n<p>Counter and UpDownCounter are often used for hot-path tracking of metrics like number of active or queued requests. In production environments, these instruments are frequently bombarded from multiple threads concurrently, which both means they need to be thread-safe but also that they need to be able to scale well. The thread-safety had been achieved by using a lock around updates (which were simply reading a value, adding to it, and storing it back), but under heavy load that could result in significant contention on the lock. To address this, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/91566\">dotnet\/runtime#91566<\/a> changed the implementation in a few ways. First, rather than using a lock to protect the state:<\/p>\n<p>lock (this)<br \/>\n{<br \/>\n    _delta += value;<br \/>\n}<\/p>\n<p>it used an interlocked operation to perform the addition atomically. Here _delta is a double, and there\u2019s no Interlocked.Add that works with double values, so instead the standard approach of using a loop around an Interlocked.CompareExchange was employed.<\/p>\n<p>double currentValue;<br \/>\ndo<br \/>\n{<br \/>\n    currentValue = _delta;<br \/>\n}<br \/>\nwhile (Interlocked.CompareExchange(ref _delta, currentValue + value, currentValue) != currentValue);<\/p>\n<p>That helps, but while this does reduce the overhead and improve scalability, it still represents a bottleneck under heavy contention. To address that, the change also split the single _delta into an array of values, one per core, and a thread picks one of them to update, typically the value associated with the core on which it\u2019s currently running. That way, contention is significantly reduced as it\u2019s both distributed across N values instead of 1 value, and because threads prefer the value for the core on which they\u2019re on, and because there\u2019s only ever one thread executing on a specific core at a given moment, chances for conflicts are significantly reduced. There is still some contention, both because a thread isn\u2019t guaranteed to use the associated value (e.g. the thread could migrate between the time it checks what core it\u2019s on and does the access) and because we actually cap the size of the array (so that it doesn\u2019t consume too much memory), but it still makes the system much more scalable.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Diagnostics.Metrics;<br \/>\nusing System.Diagnostics.Tracing;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private MetricsEventListener _listener = new MetricsEventListener();<br \/>\n    private Meter _meter = new Meter(&#8220;Example&#8221;);<br \/>\n    private Counter&lt;int&gt; _counter;<\/p>\n<p>    [GlobalSetup]<br \/>\n    public void Setup() =&gt; _counter = _meter.CreateCounter&lt;int&gt;(&#8220;counter&#8221;);<\/p>\n<p>    [GlobalCleanup]<br \/>\n    public void Cleanup()<br \/>\n    {<br \/>\n        _listener.Dispose();<br \/>\n        _meter.Dispose();<br \/>\n    }<\/p>\n<p>    [Benchmark]<br \/>\n    public void Counter_Parallel()<br \/>\n    {<br \/>\n        Parallel.For(0, 1_000_000, i =&gt;<br \/>\n        {<br \/>\n            _counter.Add(1);<br \/>\n            _counter.Add(1);<br \/>\n        });<br \/>\n    }<\/p>\n<p>    private sealed class MetricsEventListener : EventListener<br \/>\n    {<br \/>\n        protected override void OnEventSourceCreated(EventSource eventSource)<br \/>\n        {<br \/>\n            if (eventSource.Name == &#8220;System.Diagnostics.Metrics&#8221;)<br \/>\n            {<br \/>\n                EnableEvents(eventSource, EventLevel.LogAlways, EventKeywords.All, new Dictionary&lt;string, string?&gt;() { { &#8220;Metrics&#8221;, &#8220;Example\\upDownCounter;Example\\counter&#8221; } });<br \/>\n            }<br \/>\n        }<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>Counter_Parallel<br \/>\n.NET 8.0<br \/>\n137.90 ms<br \/>\n1.00<\/p>\n<p>Counter_Parallel<br \/>\n.NET 9.0<br \/>\n30.65 ms<br \/>\n0.22<\/p>\n<p>There\u2019s another interesting aspect of the improvement worth mentioning, and that\u2019s the padding employed in the array. Going from a single double _delta to an array of deltas, you might imagine we\u2019d end up with:<\/p>\n<p>private readonly double[] _deltas;<\/p>\n<p>but if you look at the code, it\u2019s instead:<\/p>\n<p>private readonly PaddedDouble[] _deltas;<\/p>\n<p>where PaddedDouble is defined as:<\/p>\n<p>[StructLayout(LayoutKind.Explicit, Size = 64)]<br \/>\nprivate struct PaddedDouble<br \/>\n{<br \/>\n    [FieldOffset(0)]<br \/>\n    public double Value;<br \/>\n}<\/p>\n<p>This effectively increases the size of each value from 8 bytes to 64 bytes, where only the first 8 bytes of each value is used and the other 56 bytes are padding. That\u2019s odd, right? Normally we\u2019d jump at an opportunity to shrink 64 bytes down to 8 bytes in order to reduce allocation and memory consumption, but here we\u2019re purposefully going in the other direction.<\/p>\n<p>The reason for that is \u201cfalse sharing.\u201d Consider this benchmark, which I\u2019ve shamelessly borrowed from a conversation Scott Hanselman and I recently recorded for the <a href=\"https:\/\/www.youtube.com\/playlist?list=PLdo4fOcmZ0oX8eqDkSw4hH9cSehrGgdr1\">Deep .NET series<\/a> but which hasn\u2019t yet posted online:<\/p>\n<p>\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private int[] _values = new int[32];<\/p>\n<p>    [Params(1, 31)]<br \/>\n    public int Index { get; set; }<\/p>\n<p>    [Benchmark]<br \/>\n    public void ParallelIncrement()<br \/>\n    {<br \/>\n        Parallel.Invoke(<br \/>\n            () =&gt; IncrementLoop(ref _values[0]),<br \/>\n            () =&gt; IncrementLoop(ref _values[Index]));<\/p>\n<p>        static void IncrementLoop(ref int value)<br \/>\n        {<br \/>\n            for (int i = 0; i &lt; 100_000_000; i++)<br \/>\n            {<br \/>\n                Interlocked.Increment(ref value);<br \/>\n            }<br \/>\n        }<br \/>\n    }<br \/>\n}<\/p>\n<p>When I run that, I get results like this:<\/p>\n<p>Method<br \/>\nIndex<br \/>\nMean<\/p>\n<p>ParallelIncrement<br \/>\n1<br \/>\n1,779.9 ms<\/p>\n<p>ParallelIncrement<br \/>\n31<br \/>\n432.3 ms<\/p>\n<p>In this benchmark, one thread is incrementing _values[0] and the other thread is incrementing either _values[1] or _values[31]. That index is the only difference, yet the one accessing _values[31] is several times faster than the one accessing _values[1]. That\u2019s because there\u2019s contention here even if it\u2019s not obvious in the code. The contention comes from the fact that the hardware works with memory in groups of bytes called a \u201ccache line.\u201d Most hardware has caches lines of 64 bytes. In order to update a particular memory location, the hardware will acquire the whole cache line. If another core wants to update that same cache line, it\u2019ll need to acquire it. That back and forth results in a lot of overhead. It doesn\u2019t matter if one core is touching the first of those 64 bytes and another thread is touching the last, from the hardware\u2019s perspective there\u2019s still sharing happening. \u201cFalse sharing.\u201d Thus, the Counter fix is using padding around the double values to try to space them out more so as to minimize the sharing that limits scalability.<\/p>\n<p>As an aside, there are some additional BenchmarkDotNet diagnosers that can help to highlight the effects of false sharing. ETW on Windows enables collecting various CPU performances counters, such as for branch misses or instructions retired, and BenchmarkDotNet has a [HardwareCounters] diagnoser that\u2019s able to collect this ETW data. One such counter is for cache misses, which often reflect false sharing issues. If you\u2019re on Windows, you can try grabbing the separate BenchmarkDotNet.Diagnostics.Windows nuget package and using it as in this benchmark:<\/p>\n<p>\/\/ This benchmark only works on Windows.<br \/>\n\/\/ Add a &lt;PackageReference Include=&#8221;BenchmarkDotNet.Diagnostics.Windows&#8221; Version=&#8221;0.14.0&#8243; \/&gt; to the csproj.<br \/>\n\/\/ dotnet run -c Release -f net9.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Diagnosers;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HardwareCounters(HardwareCounter.InstructionRetired, HardwareCounter.CacheMisses)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private int[] _values = new int[32];<\/p>\n<p>    [Params(1, 31)]<br \/>\n    public int Index { get; set; }<\/p>\n<p>    [Benchmark]<br \/>\n    public void ParallelIncrement()<br \/>\n    {<br \/>\n        Parallel.Invoke(<br \/>\n            () =&gt; IncrementLoop(ref _values[0]),<br \/>\n            () =&gt; IncrementLoop(ref _values[Index]));<\/p>\n<p>        static void IncrementLoop(ref int value)<br \/>\n        {<br \/>\n            for (int i = 0; i &lt; 100_000_000; i++)<br \/>\n            {<br \/>\n                Interlocked.Increment(ref value);<br \/>\n            }<br \/>\n        }<br \/>\n    }<br \/>\n}<\/p>\n<p>Here I\u2019ve asked for both instructions retired, which reflects how much instructions were fully executed (this in and of itself can be a useful metric when analyzing performance, as it\u2019s not as prone to variation as wall-clock measurements), and cache misses, which reflects how many times data wasn\u2019t available in the CPU\u2019s cache.<\/p>\n<p>Method<br \/>\nIndex<br \/>\nMean<br \/>\nInstructionRetired\/Op<br \/>\nCacheMisses\/Op<\/p>\n<p>ParallelIncrement<br \/>\n1<br \/>\n1,846.2 ms<br \/>\n804,300,000<br \/>\n177,889<\/p>\n<p>ParallelIncrement<br \/>\n31<br \/>\n442.5 ms<br \/>\n824,333,333<br \/>\n52,429<\/p>\n<p>In the two benchmarks, we can see that the number of instructions executed is almost the same between when false sharing occurred (Index == 1) and didn\u2019t (Index == 31), but the number of cache misses is more than three times larger in the false sharing case, and reasonably well correlated with the time increase. When one core performs a write, that invalidates the corresponding cache line in the other core\u2019s cache, such that the other core then needs to reload the cache line, resulting in cache misses. But I digress\u2026<\/p>\n<p>Another nice improvement comes in <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/105011\">dotnet\/runtime#105011<\/a> from <a href=\"https:\/\/github.com\/stevejgordon\">@stevejgordon<\/a>, adding a new constructor to Measurement. Often when creating Measurements, you\u2019re also tagging them with additional key\/value pairs of information, for which the TagList type exists. TagList implements IList&lt;KeyValuePair&lt;string, object?&gt;&gt;, and Measurement has a constructor that takes an IEnumerable&lt;KeyValuePair&lt;string, object?&gt;&gt;, so you can pass a TagList to a Measurement and it \u201cjust works\u201d\u2026 albeit slower than it could. If you had code like:<\/p>\n<p>measurements.Add(new Measurement&lt;long&gt;(<br \/>\n    snapshotV4.LastAckCount,<br \/>\n    new TagList { tcpVersionFourTag, new(NetworkStateKey, &#8220;last_ack&#8221;) }));<\/p>\n<p>that would end up boxing the TagList struct as an enumerable, and then enumerating through it via the interface, which also entails an enumerator allocation. The new constructor this PR adds takes a TagList, avoiding those overheads. TagList is also a large struct, as common usage has it living only on the stack and so as an optimization it stores some of the contained key\/value pairs directly in fields on the struct rather than always forcing an array allocation. The net result is much less overhead in constructing these measurements.<\/p>\n<p>TagList itself was also improved by <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104132\">dotnet\/runtime#104132<\/a>, which re-implemented the type for .NET 8+ on top of [InlineArray]. TagList is effectively a list of key\/value pairs, but in order to avoid always allocating a backing store, it stores some of those pairs inline in itself. This previously was done with dedicated fields for each pair, and then code that directly accessed each field. Now, an [InlineArray] is used, cleaning up the code and enabling access via spans.<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Diagnostics;<br \/>\nusing System.Diagnostics.Metrics;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private Counter&lt;long&gt; _counter;<br \/>\n    private Meter _meter;<\/p>\n<p>    [GlobalSetup]<br \/>\n    public void Setup()<br \/>\n    {<br \/>\n        this._meter = new Meter(&#8220;TestMeter&#8221;);<br \/>\n        this._counter = this._meter.CreateCounter&lt;long&gt;(&#8220;counter&#8221;);<br \/>\n    }<\/p>\n<p>    [GlobalCleanup]<br \/>\n    public void Cleanup() =&gt; this._meter.Dispose();<\/p>\n<p>    [Benchmark]<br \/>\n    public void CounterAdd()<br \/>\n    {<br \/>\n        this._counter?.Add(100, new TagList<br \/>\n        {<br \/>\n            { &#8220;Name1&#8221;, &#8220;Val1&#8221; },<br \/>\n            { &#8220;Name2&#8221;, &#8220;Val2&#8221; },<br \/>\n            { &#8220;Name3&#8221;, &#8220;Val3&#8221; },<br \/>\n            { &#8220;Name4&#8221;, &#8220;Val4&#8221; },<br \/>\n            { &#8220;Name5&#8221;, &#8220;Val5&#8221; },<br \/>\n            { &#8220;Name6&#8221;, &#8220;Val6&#8221; },<br \/>\n            { &#8220;Name7&#8221;, &#8220;Val7&#8221; },<br \/>\n        });<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>CounterAdd<br \/>\n.NET 8.0<br \/>\n31.88 ns<br \/>\n1.00<\/p>\n<p>CounterAdd<br \/>\n.NET 9.0<br \/>\n13.93 ns<br \/>\n0.44<\/p>\n<h2>Peanut Butter<\/h2>\n<p>Throughout this post, I\u2019ve tried to group improvements by topic area in order to create a more fluid and interesting discussion. However, over the course of a year, with as vibrant a community as exists for .NET, and with the breadth of functionality that exists across the platform, there are invariably a large number of one-off PRs that improve this or that by a little. It\u2019s often challenging to imagine any one of these significantly \u201cmoving the needle,\u201d but altogether, such changes reduce the \u201cpeanut butter\u201d of performance overhead spread thinly across the libraries. In no particular order, here\u2019s a non-comprehensive look at some of these:<\/p>\n<p><strong>StreamWriter.Null.<\/strong> StreamWriter exposes a static Null field. It stores a StreamWriter instance that\u2019s intended to be a \u201cbit bucket,\u201d a sink you can write to that just ignores all of the data, ala \/dev\/null on Unix, Stream.Null, and so on. Unfortunately, the way it was implemented had two problems, one of which I\u2019m incredibly surprised took us this long to discover (as it\u2019s been this way for as long as .NET has existed). It was implemented as new StreamWriter(Stream.Null, &#8230;). All of the state tracking done in StreamWriter is not thread-safe, yet here this instance is exposed from a public static member, which means it should be thread-safe, and if multiple threads hammered that StreamWriter instance, it could result in really strange exceptions occurring, like arithmetic overflow. Performance-wise, it\u2019s also problematic, because even though actual writes to the underlying Stream are ignored, all of the work actually done by StreamWriter is still done, even though it\u2019s useless. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/98473\">dotnet\/runtime#98473<\/a> fixes both of those problems by creating an internal NullStreamWriter : StreamWriter type that overrides everything to be nops, and then Null is initialized to an instance of that.<br \/>\n\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    [Benchmark]<br \/>\n    public void WriteLine() =&gt; StreamWriter.Null.WriteLine(&#8220;Hello, world!&#8221;);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>CounterAdd<br \/>\n.NET 8.0<br \/>\n7.5164 ns<br \/>\n1.00<\/p>\n<p>CounterAdd<br \/>\n.NET 9.0<br \/>\n0.0283 ns<br \/>\n0.004<\/p>\n<p><strong>NonCryptographicHashAlgorithm.Append{Async}<\/strong> NonCryptographicHashAlgorithm is the base class in System.IO.Hashing for types like XxHash3 and Crc32. One nice feature it provides is the ability to append an entire Stream\u2018s contents to it in a single call, e.g.<\/p>\n<p>XxHash3 hash = new();<br \/>\nhash.Append(someStream);<\/p>\n<p>The implementation of Append was fairly straightforward: rent a buffer from the ArrayPool and then in a loop repeatedly Stream.Read (or Stream.ReadAsync in the case of AppendAsync) into that buffer and Append that filled portion of the buffer. This has a couple of performance downsides, however. First, the buffer being rented was 4096 bytes. That\u2019s not tiny, but using a larger buffer can reduce the number of calls to the source stream being appended, which in turn can reduce any I\/O performed by that Stream. Second, many streams have optimized implementations for pushing all of their contents to a sink like this: CopyTo. MemoryStream.CopyTo, for example, will just perform a single write of its entire internal buffer to the Stream passed to its CopyTo. But even if a Stream doesn\u2019t override CopyTo, the base CopyTo implementation already provides such a copying loop, and it does so by default using a much larger rented buffer. As such, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103669\">dotnet\/runtime#103669<\/a> changes the implementation of Append to allocate a small temporary Stream object that wraps this NonCryptographicHashAlgorithm instance, and any calls to Write are just translated to be calls to Append. This is a neat example where sometimes we actually choose to pay for a small, short-lived allocation in exchange for significant throughput benefits.<\/p>\n<p>\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221;<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Configs;<br \/>\nusing BenchmarkDotNet.Environments;<br \/>\nusing BenchmarkDotNet.Jobs;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.IO.Hashing;<\/p>\n<p>var config = DefaultConfig.Instance<br \/>\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core80).WithNuGet(&#8220;System.IO.Hashing&#8221;, &#8220;8.0.0&#8221;).AsBaseline())<br \/>\n    .AddJob(Job.Default.WithRuntime(CoreRuntime.Core90).WithNuGet(&#8220;System.IO.Hashing&#8221;, &#8220;9.0.0-rc.1.24431.7&#8221;));<br \/>\nBenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args, config);<\/p>\n<p>[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;, &#8220;NuGetReferences&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private Stream _stream;<br \/>\n    private byte[] _bytes;<\/p>\n<p>    [GlobalSetup]<br \/>\n    public void Setup()<br \/>\n    {<br \/>\n        _bytes = new byte[1024 * 1024];<br \/>\n        new Random(42).NextBytes(_bytes);<\/p>\n<p>        string path = Path.GetRandomFileName();<br \/>\n        File.WriteAllBytes(path, _bytes);<br \/>\n        _stream = new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read, 0, FileOptions.DeleteOnClose);<br \/>\n    }<\/p>\n<p>    [GlobalCleanup]<br \/>\n    public void Cleanup() =&gt; _stream.Dispose();<\/p>\n<p>    [Benchmark]<br \/>\n    public ulong Hash()<br \/>\n    {<br \/>\n        _stream.Position = 0;<br \/>\n        var hash = new XxHash3();<br \/>\n        hash.Append(_stream);<br \/>\n        return hash.GetCurrentHashAsUInt64();<br \/>\n    }<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<\/p>\n<p>Hash<br \/>\n.NET 8.0<br \/>\n91.60 us<\/p>\n<p>Hash<br \/>\n.NET 9.0<br \/>\n61.26 us<\/p>\n<p><strong>Unnecessary virtual.<\/strong> virtual methods have overhead. First, they\u2019re more expensive to invoke than non-virtual methods because it requires several indirections to find the actual target method to invoke (the actual target may differ based on the concrete type being used). And second, without a technology like dynamic PGO, virtual methods won\u2019t be inlined, because the compiler can\u2019t statically see which target should be inlined (and even if dynamic PGO makes such inlining possible for the most common type, there\u2019s still a check required to ensure it\u2019s ok to follow that path). As such, if things don\u2019t <em>need<\/em> to be virtual, it\u2019s better performance-wise for them to not be. And if such things are internal, unless they\u2019re actively being overridden by something, there\u2019s no reason to keep them virtual. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104453\">dotnet\/runtime#104453<\/a> from <a href=\"https:\/\/github.com\/xtqqczze\">@xtqqczze<\/a>, <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104456\">dotnet\/runtime#104456<\/a> from <a href=\"https:\/\/github.com\/xtqqczze\">@xtqqczze<\/a>, and <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104483\">dotnet\/runtime#104483<\/a> from <a href=\"https:\/\/github.com\/xtqqczze\">@xtqqczze<\/a> all address exactly such cases, removing virtual from a smattering of internal members that weren\u2019t being overridden. It might only save a few instructions here and there, but there\u2019s effectively no downside to such a change (other than some minimal code churn), a pure win.<br \/>\n<strong>ReadOnlySpan vs Span.<\/strong> We as developers like to protect ourselves from ourselves, for example making fields readonly to avoid accidentally changing them. Such changes can also have performance benefits, for example the JIT can better optimize static fields that are readonly than those that aren\u2019t. The same set of principles applies to Span&lt;T&gt; and ReadOnlySpan&lt;T&gt;. If a method doesn\u2019t need to mutate the contents of a collection being passed in, it\u2019s less accident prone to use a ReadOnlySpan&lt;T&gt; rather than a Span&lt;T&gt;. It also signals to the caller that they don\u2019t need to be concerned about the data changing out from under them. Interestingly, here, too, there\u2019s both a correctness and a performance benefit to using ReadOnlySpan&lt;T&gt; instead of Span&lt;T&gt;. The implementations of these two types is almost word-for-word identical, the critical difference being whether the indexer returns a ref T or a ref readonly T. There is one additional line in Span&lt;T&gt;, however, that doesn\u2019t exist in ReadOnlySpan&lt;T&gt;. Span&lt;T&gt;\u2018s constructor has this one extra check:<br \/>\nif (!typeof(T).IsValueType &amp;&amp; array.GetType() != typeof(T[]))<br \/>\n    ThrowHelper.ThrowArrayTypeMismatchException();<\/p>\n<p>This check exists because of array covariance. Let\u2019s say you have this:<\/p>\n<p>Base[] array = new Derived[3];<\/p>\n<p>class Base { }<br \/>\nclass Derived : Base { }<\/p>\n<p>That code compiles and runs successfully, because .NET supports array covariance, meaning an array of a derived type can be used as an array as the base type. But there\u2019s an important catch here. Let\u2019s augment the example slightly:<\/p>\n<p>Base[] array = new Derived[3];<br \/>\narray[0] = new AlsoDerived(); \/\/ uh oh!<\/p>\n<p>class Base { }<br \/>\nclass Derived : Base { }<br \/>\nclass AlsoDerived : Base { }<\/p>\n<p>This will compile successfully, but at run-time it\u2019ll fail with an ArrayTypeMismatchException. That\u2019s because it\u2019s trying to store an AlsoDerived instance into a Derived[], and there\u2019s no relationship between the two types that should permit that. The check required to enforce that comes at a cost, every single time you try to write into an array (except in cases where the compiler can prove it\u2019s safe and elide the costs). When Span&lt;T&gt; was introduced, the decision was made to hoist that check up to the span\u2019s constructor; that way, once you get a valid span, no such checking needs to be performed on every write, only once on construction. That\u2019s what that additional line of code is doing, checking to ensure that the specified T is the same as the provided array\u2019s element type. That means code like this will also throw an ArrayTypeMismatchException:<\/p>\n<p>Span&lt;Object&gt; span = new string[2]; \/\/ uh oh<\/p>\n<p>But that also means if you use Span&lt;T&gt; in situations where you could have used ReadOnlySpan&lt;T&gt;, there\u2019s a good chance you\u2019re unnecessarily incurring that check, which means you\u2019re both possibly going to hit unexpected exceptions depending on what arrays are passed in, and you\u2019re incurring a bit of peanut butter cost. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/104864\">dotnet\/runtime#104864<\/a> replaced a bunch of Span&lt;T&gt;s with ReadOnlySpan&lt;T&gt;s to reduce the chances we\u2019d incur such overhead, while also just improving the maintainability of the code.<\/p>\n<p><strong>readonly and const.<\/strong> In the same vein, changing fields that could be const to be so, changing non-readonly fields that could be readonly to be so, and removing unnecessary property setters is all goodness for maintainability while also having the chance of improving performance. Making fields const avoids unnecessary memory accesses while also allowing the JIT to better employ constant propagation. And making static fields readonly enables the JIT to treat them as if they were const in tier 1 compilation. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100728\">dotnet\/runtime#100728<\/a> updates hundreds of occurrences.<br \/>\n<strong>MemoryCache.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103992\">dotnet\/runtime#103992<\/a> from <a href=\"https:\/\/github.com\/ADNewsom09\">@ADNewsom09<\/a> addresses an inefficiency in Microsoft.Extensions.Caching.Memory. If multiple concurrent operations end up triggering the cache\u2019s compaction operation, many of the involved threads can all end up duplicating each other\u2019s work. The fix is to simply have only one of the threads do the compaction operation.<br \/>\n<strong>BinaryReader.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/80331\">dotnet\/runtime#80331<\/a> from <a href=\"https:\/\/github.com\/teo-tsirpanis\">@teo-tsirpanis<\/a> made BinaryReader allocations relevant only to reading text be lazily allocated only when such reading occurs. If the reader is never used for reading text, the application won\u2019t need to pay for the allocation.<br \/>\n<strong>ArrayBufferWriter.<\/strong> <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/88009\">dotnet\/runtime#88009<\/a> from <a href=\"https:\/\/github.com\/AlexRadch\">@AlexRadch<\/a> adds a new ResetWrittenCount method to ArrayBufferWriter. ArrayBufferWriter.Clear already exists, but in addition to setting the written count to 0, it also clears the underlying buffer. In many situations, that clearing is unnecessary overhead, so ResetWrittenCount allows it to be avoided. (There was an interesting debate about whether such a new method is even necessary, and whether Clear could just be changed to remove the zeroing. But concerns about stale data finding their way into consuming code as corrupted data led to the new method being added instead.)<br \/>\n<strong>Span-based File methods.<\/strong> The static File class provides simple helpers for interacting with files, e.g. File.WriteAllText. Historically, these methods have worked with strings and arrays. That means, though, that if someone instead has a span, they either can\u2019t use these simple helpers or they need to pay to create a string or an array from the span. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/103308\">dotnet\/runtime#103308<\/a> adds new span-based overloads so that developers don\u2019t need to choose here between simplicity and performance.<br \/>\n<strong>string concat vs Append.<\/strong> string concatenation inside of a loop is well-known no-no, as in the extreme it can lead to significant O(N^2) costs. Such a string concatenation was occurring, however, inside of MailAddressCollection, where an encoded version of every address in the collection was being appended onto a string using string concatenation. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/95760\">dotnet\/runtime#95760<\/a> from <a href=\"https:\/\/github.com\/YohDeadfall\">@YohDeadfall<\/a> changed that to use a builder instead.<br \/>\n<strong>Closures.<\/strong> The config source generator was introduced in .NET 8 to significantly improve the performance of configuration binding, while also making it friendlier to Native AOT. It achieved both. However, it can be improved further. There\u2019s an unanticipated extra allocation that occurs on success paths that\u2019s only relevant to failure paths, because of how the code is being generated. For a call site like this:<br \/>\npublic static void M(IConfiguration configuration, C1 c) =&gt; configuration.Bind(c);<\/p>\n<p>the source generator would emit a method like this:<\/p>\n<p>public static void BindCore(IConfiguration configuration, ref C1 obj, BinderOptions? binderOptions)<br \/>\n{<br \/>\n    ValidateConfigurationKeys(typeof(C1), s_configKeys_C1, configuration, binderOptions);<br \/>\n    if (configuration[&#8220;Value&#8221;] is string value15)<br \/>\n        obj.Value = ParseInt(value15, () =&gt; configuration.GetSection(&#8220;Value&#8221;).Path);<br \/>\n}<\/p>\n<p>That lambda being passed to the ParseInt helper is accessing configuration, which is defined outside of the lambda as a parameter. To get that data into the lambda, the compiler allocates a \u201cdisplay class\u201d to store the information, with the body of the lambda translated into a method on that display class. That display class gets allocated at the beginning of the scope that contains the data, which in this case means it\u2019s allocated at the beginning of the BindCore method. That means it\u2019s allocated regardless of whether the if block is true, and even if ParseInt is called, the delegate passed to it is only ever invoked when there\u2019s a failure. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/100257\">dotnet\/runtime#100257<\/a> from <a href=\"https:\/\/github.com\/pedrobsaila\">@pedrobsaila<\/a> reworks the source generator code so that this allocation isn\u2019t incurred.<\/p>\n<p><strong>Stream.Read\/Write Span Overrides.<\/strong> Streams that don\u2019t override the span-based Read\/Write methods end up utilizing the base implementations, which allocate. There are a ton of Stream implementations in dotnet\/runtime, and we\u2019ve overridden such methods almost everywhere, but now and again we find one that slipped through. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/86674\">dotnet\/runtime#86674<\/a> from <a href=\"https:\/\/github.com\/hrrrrustic\">@hrrrrustic<\/a> fixed one such case on the StreamOnSqlBytes type.<br \/>\n<strong>Globalization Arrays.<\/strong> Every NumberFormatInfo object defaults its NumberGroupSizes, CurrentGroupSizes, and PercentGroupSizes to each be new instances of new int[] { 3 } (even if subsequent initialization overwrites them). And yet these arrays are never handed out to consumers: the properties that expose them make defensive copies. Which means all of these can just refer to the same shared singleton array. The same is true for NativeDigits, which is initialized first to a new array of the numbers 0 through 9. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/93117\">dotnet\/runtime#93117<\/a> addresses all of these by creating and using such singletons.<br \/>\n<strong>ColorTranslator.ToWin32.<\/strong> There\u2019s a <a href=\"https:\/\/learn.microsoft.com\/dotnet\/standard\/design-guidelines\/property\">.NET design guideline<\/a> that says properties should be like smarter fields. The resulting expectation is that they should be cheap, like just accessing a field or doing a very simple calculation over a field. Unfortunately, we don\u2019t always follow our own guidance, and there exist some properties that really, really look like they should be trivial but are actually sometimes not. System.Drawing.Color is a good example. A very reasonable mental model for Color (which according to the docs \u201cRepresents an ARGB (alpha, red, green, blue) color\u201d) is that it\u2019d just be four byte values, one for each channel, either in their own fields or packed together into an int. Unfortunately, it\u2019s not quite as simple as that. Color <em>can<\/em> be that, such as if it\u2019s constructed using Color.FromArgb(uint), but it can also be used to represent a \u201cknown colors,\u201d as is evident by the SystemColors type having a bunch of static properties (e.g. SystemColors.Control) that return a color for the underlying OS. And even there, you might think \u201coh, ok, well those properties must be what Stephen is referring to, they must call out to the OS to get the color, they probably do so and then use FromArgb.\u201d And again, that\u2019s a very intuitive mental model, and again it\u2019s not what actually happens. Those properties actually are cheap; all they do is construct a Color with an enum value corresponding to the system color. Then where is the actual OS color value retrieved, you ask? As part of the R, G, B, and A properties on Color!. That means if you access each of these properties, as ColorTranslator was doing in a variety of its methods, you\u2019re making three or four times as many P\/Invokes as you\u2019d otherwise need to. <a href=\"https:\/\/github.com\/dotnet\/runtime\/pull\/106042\">dotnet\/runtime#106042<\/a> fixes this for ColorTranslator, but it serves as a good reminder why such guidelines exist. (This benchmark is Windows-specific as SystemColor doesn\u2019t currently rely on OS information for Linux or macOS.)<br \/>\n\/\/ Windows-specific (it works on Linux and macOS, but doesn&#8217;t demonstrate the same thing.)<br \/>\n\/\/ dotnet run -c Release -f net8.0 &#8211;filter &#8220;*&#8221; &#8211;runtimes net8.0 net9.0<\/p>\n<p>using BenchmarkDotNet.Attributes;<br \/>\nusing BenchmarkDotNet.Running;<br \/>\nusing System.Drawing;<\/p>\n<p>BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);<\/p>\n<p>[MemoryDiagnoser(false)]<br \/>\n[HideColumns(&#8220;Job&#8221;, &#8220;Error&#8221;, &#8220;StdDev&#8221;, &#8220;Median&#8221;, &#8220;RatioSD&#8221;)]<br \/>\npublic class Tests<br \/>\n{<br \/>\n    private Color _color = SystemColors.Control;<\/p>\n<p>    [Benchmark]<br \/>\n    public int ColorToWin32() =&gt; ColorTranslator.ToWin32(_color);<br \/>\n}<\/p>\n<p>Method<br \/>\nRuntime<br \/>\nMean<br \/>\nRatio<\/p>\n<p>ColorToWin32<br \/>\n.NET 8.0<br \/>\n11.263 ns<br \/>\n1.00<\/p>\n<p>ColorToWin32<br \/>\n.NET 9.0<br \/>\n4.711 ns<br \/>\n0.42<\/p>\n<h2>What\u2019s Next?<\/h2>\n<p>Maybe one more poem? An acrostic this time:<\/p>\n<p>Driving innovation with unmatched speed,<br \/>\nOpening doors to what developers need.<br \/>\nTurbocharged perf, breaking the mold,<br \/>\nNew benchmarks surpassed, metrics so bold.<br \/>\nEmpowering coders, dreams take flight,<br \/>\nTransforming visions with .NET might.<\/p>\n<p>Navigating challenges with precision and flair,<br \/>\nInspiring creativity, improvements everywhere.<br \/>\nNurturing growth, pushing limits high,<br \/>\nElevating success, reaching for the sky.<\/p>\n<p>Several hundred pages later and still not a poet. Oh well.<\/p>\n<p>I\u2019m asked from time to time why I invest in writing these \u201cPerformance Improvements in .NET\u201d posts. There\u2019s no one answer. In no particular order:<\/p>\n<p><strong>Personal learning.<\/strong> I pay close attention throughout the year to all of the various performance improvements happening in the release, sometimes from a distance, sometimes as the one making the changes. Writing this post serves as a forcing-function for me to revisit them all and really internalize the changes that were made and their relevance to the broader picture. It\u2019s a learning opportunity for me.<br \/>\n<strong>Testing.<\/strong> As one of the developers on the team recently said to me, \u201cI like this time of the year when you give our optimizations a stress-test and uncover inefficiencies.\u201d Every year when I\u2019m going through the improvements, just the act of re-validating the improvements often highlights regressions, cases that were missed, or further opportunities that can be addressed in the future. Again, it\u2019s a forcing function to do more testing and with a fresh set of eyes.<br \/>\n<strong>Thanks.<\/strong> Many of the performance improvements in each release aren\u2019t from the folks working on the .NET team or even at Microsoft. They\u2019re from passionate and talented individuals throughout the global .NET ecosystem, and I like to highlight their contributions. That\u2019s why throughout the post you see me calling out when PRs are from folks not employed by Microsoft as full-time employees. In this post, that accounts for ~20% of all the cited PRs. Amazing. Heartfelt thanks to everyone who\u2019s worked to make .NET better for everyone.<br \/>\n<strong>Excitement.<\/strong> Developers often have conflicting opinions about the speed at which .NET is advancing, some really appreciating the frequent introduction of new features, others concerned that they can\u2019t keep up with all of the newness. But the one thing everyone seems to agree on is the love of \u201cfree perf,\u201d and that\u2019s a lot of what these posts talk about. .NET gets faster and faster every release, and it\u2019s exciting to see a tour through the highlights collected all in one place.<br \/>\n<strong>Education.<\/strong> There are multiple forms of performance improvements covered throughout the post. Some of the improvements you get completely for free just by upgrading the runtime; the implementations in the runtime are better, and so when you run on them, your code just gets better, too. Some of the improvements you get completely for free by upgrading the runtime <em>and<\/em> recompiling; the C# compiler itself generates better code, often taking advantage of newer surface area exposed in the runtime. And other improvements are new features that, in addition to the runtime and compiler utilizing, you can utilize directly and make your code even faster. Educating about those capabilities and why and where you\u2019d want to utilize them is important to me. But beyond the new features, the techniques employed in making all of the rest of the optimizations throughout the runtime are often more broadly applicable. By learning how these optimizations are applied in the runtime, you can extrapolate and apply similar techniques to your own code, making it that much faster.<\/p>\n<p>If you\u2019ve read this far, I hope you indeed have learned something and are excited about the .NET 9 release. As is likely obvious from my enthusiastic ramblings and awkward poetry, I\u2019m incredibly excited about .NET, everything that\u2019s been achieved in .NET 9, and the future of the platform. If you\u2019re already using .NET 8, upgrading to .NET 9 should be a breeze (the <a href=\"https:\/\/dotnet.microsoft.com\/download\/dotnet\/9.0\">.NET 9 Release Candidate<\/a> is available for download), and I\u2019d love it if you\u2019d do so and share with us any successes you achieve or issues you face along the way. We\u2019d love to learn from you. And if you have ideas about how to further improve the performance of .NET for .NET 10, please join us in <a href=\"https:\/\/github.com\/dotnet\/runtime\">dotnet\/runtime<\/a>.<\/p>\n<p>Happy coding!<\/p>\n<p>The post <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\/performance-improvements-in-net-9\/\">Performance Improvements in .NET 9<\/a> appeared first on <a href=\"https:\/\/devblogs.microsoft.com\/dotnet\">.NET Blog<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Each year, summer arrives to find me daunted and excited to write about the performance improvements in the upcoming release [&hellip;]<\/p>\n","protected":false},"author":0,"featured_media":0,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[7],"tags":[],"class_list":["post-1228","post","type-post","status-publish","format-standard","hentry","category-dotnet"],"_links":{"self":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/1228","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/comments?post=1228"}],"version-history":[{"count":0,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/posts\/1228\/revisions"}],"wp:attachment":[{"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/media?parent=1228"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/categories?post=1228"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rssfeedtelegrambot.bnaya.co.il\/index.php\/wp-json\/wp\/v2\/tags?post=1228"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}