Sorting Networks Part 2: From 27/28 to a Source Generator for Sizes 2–64

May 2026 — Jonathan Peppers

In Part 1, I took a CS paper about depth-13 sorting networks, implemented it in C# with GitHub Copilot, added SIMD vectorization, and got up to 41x faster than Array.Sort — all in 3 days with 59 PRs. Part 1 ended with an open question:

Could this go in the BCL? Sorting networks need a custom network per array length — you’d need networks for every size, not just 27/28. A C# Source Generator inside a NuGet package may be a better fit for this narrower use case.

So that’s exactly what I did. The 27/28-only hardcoded sorter is now a Roslyn source generator that emits optimized sort methods for any size from 2 to 64, for any type. It’s published as a NuGet package — SortingNetworks.SourceGen — and the initial release is 0.1.0 on GitHub.

GitHub repo: jonathanpeppers/SortingNetworks

What Changed Since Part 1

Part 1’s implementation had a scripts/generate_unrolled.cs script that produced .cs files checked into the repo — a static code generator. It only worked for sizes 27 and 28, using the networks from the paper.

The new approach is a Roslyn incremental source generator. You add one NuGet package, decorate a partial class with attributes, and the generator produces optimized sort methods at compile time — no checked-in generated code, no scripts to run.

From PR #63 to PR #119, here’s what was built:

Usage

dotnet add package SortingNetworks.SourceGen
using SortingNetworks;

[SortingNetwork(27, typeof(int))]
[SortingNetwork(28, typeof(int))]
[SortingNetwork(10, typeof(double))]
partial class MySorter { }

// Sort a span of exactly 27 ints — uses sorting network
Span<int> data = [27, 26, 25, /* ... */ 3, 2, 1];
MySorter.Sort(data);

// Array overload works too
int[] array = [3, 1, 4, 1, 5, 9, 2, 6, /* ... 27 elements */];
MySorter.Sort(array);

// IComparer<T> for custom ordering
MySorter.Sort(data, Comparer<int>.Create((a, b) => b.CompareTo(a)));

That’s it. The source generator does the rest at compile time: picks the optimal sorting network for each declared size, emits unrolled scalar compare-and-swap code, and — where profitable — emits SIMD-vectorized paths with runtime hardware detection.

Custom Types

Any value type implementing IComparable<T> gets unrolled .CompareTo() methods — the JIT devirtualizes and inlines these on value types:

public record struct Point(int X, int Y) : IComparable<Point>
{
    public int CompareTo(Point other)
    {
        int cmp = X.CompareTo(other.X);
        return cmp != 0 ? cmp : Y.CompareTo(other.Y);
    }
}

[SortingNetwork(27, typeof(Point))]
partial class MySorter { }

Span<Point> points = /* ... */;
MySorter.Sort(points); // unrolled CompareTo

For types that don’t implement IComparable<T>, the IComparer<T> overload is generated — you pass a comparer at runtime.

Handling Unsupported Sizes

Define an OnFallback method to handle sizes outside your declared networks:

[SortingNetwork(27, typeof(int))]
partial class MySorter
{
    static void OnFallback(Span<int> span)
    {
        span.Sort(); // delegate to BCL for other sizes
    }
}

MySorter.Sort(data);       // size 27 → sorting network
MySorter.Sort(otherData);  // any other size → OnFallback

Without OnFallback, unsupported sizes throw ArgumentException.

The Architecture Challenge

Part 1’s code generator was a script that wrote .cs files to disk. That works fine for a single repo with two sizes, but it doesn’t scale:

A Roslyn source generator solves all of this. The generator runs inside the compiler, reads your [SortingNetwork] attributes, and emits code as part of the build. IntelliSense works immediately. No scripts. No generated files to check in.

The generator has these main components:

Component Purpose
SortingNetworkGenerator.cs Roslyn IIncrementalGenerator entry point — reads attributes, dispatches to emitters
NetworkDatabase.cs Database of best-known sorting networks for sizes 2–64
BatcherNetworkBuilder.cs Builds Batcher’s odd-even merge sort networks algorithmically
ScalarEmitter.cs Emits unrolled compare-and-swap code for all types
SimdX86Emitter.cs Emits AVX2 and AVX-512 SIMD paths
SimdArmEmitter.cs Emits ARM64 AdvSimd/NEON paths

The NetworkDatabase holds the actual comparator sequences. For sizes 2–22, networks are generated at compile time via Batcher’s algorithm. For sizes 23–32, optimal/best-known networks are embedded (including the depth-13 networks from the paper for 27/28). For sizes 33–64, best-known networks from Dobbelaere’s SorterHunter project are used.

Expanding to All Sizes

Part 1 only supported sizes 27 and 28. Getting to 2–64 required several things:

Sizes 2–22 — Batcher’s odd-even merge sort: These small sizes use a well-known constructive algorithm. Batcher’s method isn’t optimal (it uses more comparators than necessary), but it produces valid networks with reasonable depth for small inputs. The BatcherNetworkBuilder generates these programmatically — no hardcoded data needed.

Sizes 23–26 and 29–32 — Best-known optimal networks (PR #74): These fill the gap between the Batcher range and the paper’s sizes. They come from research into optimal sorting networks and are embedded directly in NetworkDatabase.cs.

Sizes 33–64 — SorterHunter (PR #100): Bert Dobbelaere’s SorterHunter project maintains a database of best-known sorting networks. These are the current best results from the research community.

SIMD for Every Size

The Part 1 SIMD paths were hardcoded for 27/28 elements. Making the SIMD emitters work for arbitrary sizes 2–64 was a significant effort.

x86: AVX2 and AVX-512

The SimdX86Emitter generates hardware-specific SIMD code based on element type and size:

ARM64: AdvSimd/NEON

The SimdArmEmitter generates NEON paths using Vector128 registers:

A critical optimization on ARM64 was TBL1 vs TBL4 (PR #101). When all elements of a shuffled vector come from a single source register, the generator emits Vector128.Shuffle (TBL1) instead of VectorTableLookup (TBL4). On Ampere Altra/Neoverse cores, TBL4 has significantly higher latency than TBL1 — this optimization was the difference between being slower and 1.6x faster than Array.Sort for char on those cores.

Benchmark Results

All benchmarks run on .NET 10 with BenchmarkDotNet, zero allocations confirmed via [MemoryDiagnoser].

x86 AVX2 (Intel Core i9-9900K)

Type Array.Sort (27) GeneratedSort (27) Speedup
byte 1,303 ns 38 ns 34x
sbyte 1,479 ns 39 ns 38x
float 1,598 ns 66 ns 24x
double 1,639 ns 100 ns 16x
short 1,407 ns 100 ns 14x
long 1,465 ns 100 ns 15x
int 105 ns 54 ns 1.9x
string 1,076 ns 514 ns 2.1x
decimal 2,376 ns 794 ns 3.0x

x86 AVX-512 VBMI (AMD EPYC 9V74)

With AVX-512 VBMI, all byte/sbyte sizes fit in a single Vector512:

Type Size Array.Sort GeneratedSort Speedup
byte 27 1,250 ns 53 ns 24x
byte 32 1,516 ns 54 ns 28x
byte 34 1,759 ns 64 ns 27x
sbyte 38 2,160 ns 68 ns 32x

ARM64 (Apple M1, AdvSimd/NEON)

Type Array.Sort (27) GeneratedSort (27) Speedup
byte 1,132 ns 31 ns 37x
sbyte 1,206 ns 32 ns 38x
short 1,124 ns 48 ns 23x
float 1,363 ns 74 ns 18x
double 1,461 ns 110 ns 13x

Sizes 33–64 (ARM64, SIMD-accelerated)

Type Size Array.Sort GeneratedSort Speedup
byte 34 2,831 ns 85 ns 33x
sbyte 38 3,340 ns 94 ns 36x
short 40 3,439 ns 164 ns 21x
float 36 3,269 ns 220 ns 15x

Custom value types

Type Array.Sort (27) GeneratedSort (27) Speedup
Record (2× int) 3,839 ns 1,129 ns 3.4x

The Journey: PRs #63–119

Part 1 ended at PR #59 with 59 PRs in 3 days. The source generator work added another 21 PRs over the following two weeks, bringing the total to 80 merged PRs.

Phase 1 — Source Generator Foundation (Apr 25)

Phase 2 — Expand to All Sizes (Apr 28 – May 2)

Phase 3 — Type System & API (May 2–3)

Phase 4 — Ship It (May 3–5)

Lessons Learned

Source generators have sharp edges

Incremental source generators need careful equality implementations. PR #114 fixed a bug where the generator wasn’t caching properly — the Equatable model types needed correct Equals/GetHashCode implementations or the generator would re-run on every keystroke, killing IDE performance.

Network selection matters more than you’d think

Batcher’s odd-even merge sort is elegant and easy to generate algorithmically, but it’s not optimal — it uses more comparators than necessary. For sizes 23+, switching to best-known networks from research databases (the arXiv paper, SorterHunter) saved layers and comparators, which translates directly to fewer SIMD instructions or fewer branches in the scalar path.

SIMD register strategy varies by type width

The generator can’t use a one-size-fits-all SIMD approach. byte fits 64 elements in one Vector512, but double fits only 8 in a Vector512 — meaning 27 doubles need four registers and cross-vector shuffles. The emitters have per-type logic to pick the right register count and shuffle strategy.

ARM64 is not one architecture

The TBL1 optimization story (PR #101) was a reminder that ARM64 performance varies wildly between implementations. Apple Silicon handles TBL4 (4-register table lookup) efficiently, but Ampere Neoverse does not. Without the TBL1 optimization for intra-register shuffles, the SIMD path was actually slower than Array.Sort on Ampere for char.

What’s Next

The source generator is published and usable today. Areas for future work: