Sorting Networks Part 2: From 27/28 to a Source Generator for Sizes 2–64
May 2026 — Jonathan Peppers
In Part 1, I took a CS paper about depth-13 sorting networks, implemented it in C# with GitHub Copilot, added SIMD vectorization, and got up to 41x faster than Array.Sort — all in 3 days with 59 PRs. Part 1 ended with an open question:
Could this go in the BCL? Sorting networks need a custom network per array length — you’d need networks for every size, not just 27/28. A C# Source Generator inside a NuGet package may be a better fit for this narrower use case.
So that’s exactly what I did. The 27/28-only hardcoded sorter is now a Roslyn source generator that emits optimized sort methods for any size from 2 to 64, for any type. It’s published as a NuGet package — SortingNetworks.SourceGen — and the initial release is 0.1.0 on GitHub.
GitHub repo: jonathanpeppers/SortingNetworks
What Changed Since Part 1
Part 1’s implementation had a scripts/generate_unrolled.cs script that produced .cs files checked into the repo — a static code generator. It only worked for sizes 27 and 28, using the networks from the paper.
The new approach is a Roslyn incremental source generator. You add one NuGet package, decorate a partial class with attributes, and the generator produces optimized sort methods at compile time — no checked-in generated code, no scripts to run.
From PR #63 to PR #119, here’s what was built:
- Source generator — Roslyn
IIncrementalGeneratorthat reads[SortingNetwork]attributes and emits sort methods - Sizes 2–64 — Using Batcher’s odd-even merge networks (2–22), best-known optimal networks (23–32), and Dobbelaere’s SorterHunter networks (33–64)
- All types — Every primitive,
decimal,string, customIComparable<T>value types, and arbitrary types viaIComparer<T> - Full SIMD coverage — x86 AVX2, AVX-512 (including VBMI), ARM64 AdvSimd/NEON — all generated per-type, per-size
- NuGet package — Published as
SortingNetworks.SourceGenwith public API analyzers and proper versioning
Usage
dotnet add package SortingNetworks.SourceGen
using SortingNetworks;
[SortingNetwork(27, typeof(int))]
[SortingNetwork(28, typeof(int))]
[SortingNetwork(10, typeof(double))]
partial class MySorter { }
// Sort a span of exactly 27 ints — uses sorting network
Span<int> data = [27, 26, 25, /* ... */ 3, 2, 1];
MySorter.Sort(data);
// Array overload works too
int[] array = [3, 1, 4, 1, 5, 9, 2, 6, /* ... 27 elements */];
MySorter.Sort(array);
// IComparer<T> for custom ordering
MySorter.Sort(data, Comparer<int>.Create((a, b) => b.CompareTo(a)));
That’s it. The source generator does the rest at compile time: picks the optimal sorting network for each declared size, emits unrolled scalar compare-and-swap code, and — where profitable — emits SIMD-vectorized paths with runtime hardware detection.
Custom Types
Any value type implementing IComparable<T> gets unrolled .CompareTo() methods — the JIT devirtualizes and inlines these on value types:
public record struct Point(int X, int Y) : IComparable<Point>
{
public int CompareTo(Point other)
{
int cmp = X.CompareTo(other.X);
return cmp != 0 ? cmp : Y.CompareTo(other.Y);
}
}
[SortingNetwork(27, typeof(Point))]
partial class MySorter { }
Span<Point> points = /* ... */;
MySorter.Sort(points); // unrolled CompareTo
For types that don’t implement IComparable<T>, the IComparer<T> overload is generated — you pass a comparer at runtime.
Handling Unsupported Sizes
Define an OnFallback method to handle sizes outside your declared networks:
[SortingNetwork(27, typeof(int))]
partial class MySorter
{
static void OnFallback(Span<int> span)
{
span.Sort(); // delegate to BCL for other sizes
}
}
MySorter.Sort(data); // size 27 → sorting network
MySorter.Sort(otherData); // any other size → OnFallback
Without OnFallback, unsupported sizes throw ArgumentException.
The Architecture Challenge
Part 1’s code generator was a script that wrote .cs files to disk. That works fine for a single repo with two sizes, but it doesn’t scale:
- Users would need to run a script and check in generated code
- Adding a new size means re-running the script
- No IDE integration — no IntelliSense on generated methods until you build
A Roslyn source generator solves all of this. The generator runs inside the compiler, reads your [SortingNetwork] attributes, and emits code as part of the build. IntelliSense works immediately. No scripts. No generated files to check in.
The generator has these main components:
| Component | Purpose |
|---|---|
SortingNetworkGenerator.cs |
Roslyn IIncrementalGenerator entry point — reads attributes, dispatches to emitters |
NetworkDatabase.cs |
Database of best-known sorting networks for sizes 2–64 |
BatcherNetworkBuilder.cs |
Builds Batcher’s odd-even merge sort networks algorithmically |
ScalarEmitter.cs |
Emits unrolled compare-and-swap code for all types |
SimdX86Emitter.cs |
Emits AVX2 and AVX-512 SIMD paths |
SimdArmEmitter.cs |
Emits ARM64 AdvSimd/NEON paths |
The NetworkDatabase holds the actual comparator sequences. For sizes 2–22, networks are generated at compile time via Batcher’s algorithm. For sizes 23–32, optimal/best-known networks are embedded (including the depth-13 networks from the paper for 27/28). For sizes 33–64, best-known networks from Dobbelaere’s SorterHunter project are used.
Expanding to All Sizes
Part 1 only supported sizes 27 and 28. Getting to 2–64 required several things:
Sizes 2–22 — Batcher’s odd-even merge sort: These small sizes use a well-known constructive algorithm. Batcher’s method isn’t optimal (it uses more comparators than necessary), but it produces valid networks with reasonable depth for small inputs. The BatcherNetworkBuilder generates these programmatically — no hardcoded data needed.
Sizes 23–26 and 29–32 — Best-known optimal networks (PR #74): These fill the gap between the Batcher range and the paper’s sizes. They come from research into optimal sorting networks and are embedded directly in NetworkDatabase.cs.
Sizes 33–64 — SorterHunter (PR #100): Bert Dobbelaere’s SorterHunter project maintains a database of best-known sorting networks. These are the current best results from the research community.
SIMD for Every Size
The Part 1 SIMD paths were hardcoded for 27/28 elements. Making the SIMD emitters work for arbitrary sizes 2–64 was a significant effort.
x86: AVX2 and AVX-512
The SimdX86Emitter generates hardware-specific SIMD code based on element type and size:
- byte/sbyte: AVX-512 VBMI uses a single
Vector512<byte>withPermuteVar64x8for all sizes up to 64. On CPUs without VBMI, AVX2 falls back toVector256<byte>for sizes up to 32. - short/ushort/char: AVX-512BW uses
PermuteVar32x16(single register for sizes ≤32) orPermuteVar32x16x2(two registers for sizes 33–64). - int/uint/float: AVX2 uses four
Vector256registers withPermuteVar8x32cross-vector shuffles. AVX-512F uses twoVector512registers withPermuteVar16x32x2. - long/ulong/double: AVX-512F uses four
Vector512registers. AVX2 fallback uses sevenVector256<double>withPermute4x64.
ARM64: AdvSimd/NEON
The SimdArmEmitter generates NEON paths using Vector128 registers:
- byte/sbyte: Up to four
Vector128registers with TBL4 lookups (sizes up to 64) - short/ushort/char: Up to eight registers with multi-stage TBL/TBX chains (sizes up to 64)
- int/uint/float: Up to eight registers with two-stage TBL/TBX (sizes up to 32; scalar fallback above 32)
A critical optimization on ARM64 was TBL1 vs TBL4 (PR #101). When all elements of a shuffled vector come from a single source register, the generator emits Vector128.Shuffle (TBL1) instead of VectorTableLookup (TBL4). On Ampere Altra/Neoverse cores, TBL4 has significantly higher latency than TBL1 — this optimization was the difference between being slower and 1.6x faster than Array.Sort for char on those cores.
Benchmark Results
All benchmarks run on .NET 10 with BenchmarkDotNet, zero allocations confirmed via [MemoryDiagnoser].
x86 AVX2 (Intel Core i9-9900K)
| Type | Array.Sort (27) | GeneratedSort (27) | Speedup |
|---|---|---|---|
| byte | 1,303 ns | 38 ns | 34x |
| sbyte | 1,479 ns | 39 ns | 38x |
| float | 1,598 ns | 66 ns | 24x |
| double | 1,639 ns | 100 ns | 16x |
| short | 1,407 ns | 100 ns | 14x |
| long | 1,465 ns | 100 ns | 15x |
| int | 105 ns | 54 ns | 1.9x |
| string | 1,076 ns | 514 ns | 2.1x |
| decimal | 2,376 ns | 794 ns | 3.0x |
x86 AVX-512 VBMI (AMD EPYC 9V74)
With AVX-512 VBMI, all byte/sbyte sizes fit in a single Vector512:
| Type | Size | Array.Sort | GeneratedSort | Speedup |
|---|---|---|---|---|
| byte | 27 | 1,250 ns | 53 ns | 24x |
| byte | 32 | 1,516 ns | 54 ns | 28x |
| byte | 34 | 1,759 ns | 64 ns | 27x |
| sbyte | 38 | 2,160 ns | 68 ns | 32x |
ARM64 (Apple M1, AdvSimd/NEON)
| Type | Array.Sort (27) | GeneratedSort (27) | Speedup |
|---|---|---|---|
| byte | 1,132 ns | 31 ns | 37x |
| sbyte | 1,206 ns | 32 ns | 38x |
| short | 1,124 ns | 48 ns | 23x |
| float | 1,363 ns | 74 ns | 18x |
| double | 1,461 ns | 110 ns | 13x |
Sizes 33–64 (ARM64, SIMD-accelerated)
| Type | Size | Array.Sort | GeneratedSort | Speedup |
|---|---|---|---|---|
| byte | 34 | 2,831 ns | 85 ns | 33x |
| sbyte | 38 | 3,340 ns | 94 ns | 36x |
| short | 40 | 3,439 ns | 164 ns | 21x |
| float | 36 | 3,269 ns | 220 ns | 15x |
Custom value types
| Type | Array.Sort (27) | GeneratedSort (27) | Speedup |
|---|---|---|---|
| Record (2× int) | 3,839 ns | 1,129 ns | 3.4x |
The Journey: PRs #63–119
Part 1 ended at PR #59 with 59 PRs in 3 days. The source generator work added another 21 PRs over the following two weeks, bringing the total to 80 merged PRs.
Phase 1 — Source Generator Foundation (Apr 25)
- PR #63: Initial Roslyn source generator —
[SortingNetwork]attribute, scalar emitter for all primitives - PR #70: Array overloads (
T[]in addition toSpan<T>) - PR #71: ARM SIMD emitter for the source generator
- PR #73: AVX2 fallback for 16-bit types
- PR #75: AVX2 path for
double
Phase 2 — Expand to All Sizes (Apr 28 – May 2)
- PR #74: Add optimal networks for sizes 23–26 and 29–32
- PR #100: Add best-known networks for sizes 33–64 from SorterHunter
- PR #88: Expand SIMD emitters to support sizes 33–64
- PR #89: String type support in the source generator
- PR #81:
IComparer<T>overloads for custom ordering
Phase 3 — Type System & API (May 2–3)
- PR #98: Support arbitrary types via
IComparer<T>(not justIComparable<T>) - PR #99:
OnFallbackAPI for configurable behavior on unsupported sizes - PR #103: Fix
nint/nuintscalar perf by delegating to fixed-size types - PR #113: Add
decimalas a first-class type - PR #107: AVX-512 VBMI as primary path for all byte/sbyte sizes
Phase 4 — Ship It (May 3–5)
- PR #90: Remove the old non-generator sorter, restructure for NuGet
- PR #97: Add sample project to verify end-to-end
- PR #109: PublicApi analyzer for API surface tracking
- PR #114: Fix incremental generator caching for IDE performance
- PR #116: Rename NuGet package to
SortingNetworks.SourceGen - PR #118: Target
net8.0as minimum TFM
Lessons Learned
Source generators have sharp edges
Incremental source generators need careful equality implementations. PR #114 fixed a bug where the generator wasn’t caching properly — the Equatable model types needed correct Equals/GetHashCode implementations or the generator would re-run on every keystroke, killing IDE performance.
Network selection matters more than you’d think
Batcher’s odd-even merge sort is elegant and easy to generate algorithmically, but it’s not optimal — it uses more comparators than necessary. For sizes 23+, switching to best-known networks from research databases (the arXiv paper, SorterHunter) saved layers and comparators, which translates directly to fewer SIMD instructions or fewer branches in the scalar path.
SIMD register strategy varies by type width
The generator can’t use a one-size-fits-all SIMD approach. byte fits 64 elements in one Vector512, but double fits only 8 in a Vector512 — meaning 27 doubles need four registers and cross-vector shuffles. The emitters have per-type logic to pick the right register count and shuffle strategy.
ARM64 is not one architecture
The TBL1 optimization story (PR #101) was a reminder that ARM64 performance varies wildly between implementations. Apple Silicon handles TBL4 (4-register table lookup) efficiently, but Ampere Neoverse does not. Without the TBL1 optimization for intra-register shuffles, the SIMD path was actually slower than Array.Sort on Ampere for char.
What’s Next
The source generator is published and usable today. Areas for future work:
- More sizes — The current limit is 64. Larger networks exist but may not be practical for sorting network approaches.
- Descending sort — Currently requires
IComparer<T>overload; a dedicated attribute parameter could emit more optimal code. - Runtime BCL integration — If .NET ever wants built-in sorting network support, the generator’s architecture could inform that design.