Sorting Networks: From Paper to Performance with AI
April 2026 β Jonathan Peppers
Can AI help a developer move faster from research paper to working, optimized code? I set out to answer that question by finding a recent CS paper, implementing its algorithm in C# with GitHub Copilot, and benchmarking it against .NETβs built-in sort. The result: up to 41x faster for specialized sizes β built in 3 days with 59 PRs.
GitHub repo: jonathanpeppers/SortingNetworks
Why This Talk
Most developers havenβt seen what AI can do today. The tooling has moved fast β itβs worth showing real examples, not just hype. I have access to a lot of tokens, so let me show you whatβs possible with a real project, start to finish. My entire workflow has changed β I donβt use an IDE anymore. This project was built entirely with AI-assisted tools.
The Premise
- Find a Paper β Discover a recent CS paper with a new algorithm (November 2025)
- Use AI to Implement β Use GitHub Copilot to implement the algorithm in C# / .NET
- Benchmark It β Compare against the existing Base Class Library (BCL) sort
Prior Art: Paper β Implementation
This pattern isnβt new. John Carmack did it in 1993 when building the Doom engine β he found an academic paper on Binary Space Partitioning (BSP) trees by Fuchs, Kedem & Naylor (SIGGRAPH 1980) and implemented it to subdivide game levels so the engine only rendered visible surfaces. That breakthrough made Doomβs sprawling levels run in real-time on early β90s PCs.
The pattern: Find a Paper β Implement the Algorithm β Ship Something Faster
- Carmack (1993): BSP paper (1980) β Doom engine β real-time 3D on consumer PCs
- This project (2026): Sorting network paper (2025) β SIMD sort β up to 41x faster
GitHub Copilot Products Used
- Copilot in VS Code β AI-powered code completion and chat directly in the editor
- Copilot CLI β AI assistant in the terminal for code generation and refactoring
- Copilot Code Review β AI-powered pull request reviews that catch bugs and suggest improvements
- Copilot Coding Agent β Autonomous agent that implements features from GitHub Issues
The Paper
βDepth-13 Sorting Networks for 28 Channelsβ Chengu Wang β arXiv:2511.04107 β Published November 6, 2025
Key Finding: Improved depth upper bounds for 27 and 28 channel sorting networks from 14 to 13 layers, using reflectional symmetry + SAT solver.
Timeline:
- Nov 2025 β Paper published
- Apr 17, 2026 β Project started
- Apr 18, 2026 β SIMD optimized
How Sorting Networks Work
A sorting network is a fixed sequence of compare-and-swap operations. Unlike quicksort or mergesort, the sequence of comparisons is determined entirely by the input length β not by the data values.
Key properties:
- Data-oblivious β Comparisons depend on size, not values
- Branch-free β Eliminates input-dependent control flow
- Parallelizable β Independent pairs in the same layer run at once
- Fixed depth β Fewer layers = lower latency
Each compare-and-swap is simple:
if (a > b) {
swap(a, b)
}
This paper: 28 channels, 13 layers deep, ~185 comparators (previous best: 14 layers).
A Simple Example: Sort 3 Elements
A depth-3 network with 3 comparators β the full 27-element version uses the same pattern at scale:
// Sort 3 elements with a sorting network
static void Sort3(ref int e0, ref int e1, ref int e2)
{
// Layer 1
if (e0 > e1) { int t = e0; e0 = e1; e1 = t; }
// Layer 2
if (e1 > e2) { int t = e1; e1 = e2; e2 = t; }
// Layer 3
if (e0 > e1) { int t = e0; e0 = e1; e1 = t; }
}
Walkthrough with [5, 1, 3]:
| Step | Action | Result |
|---|---|---|
| Layer 1 | Compare e0, e1 β swap! | [1, 5, 3] |
| Layer 2 | Compare e1, e2 β swap! | [1, 3, 5] |
| Layer 3 | Compare e0, e1 β no swap | [1, 3, 5] β Sorted! |
Same pattern at scale: 27 elements Γ 13 layers Γ ~185 comparators β all code-generated.
C# Scalar Implementation
Every compare-and-swap is unrolled into straight-line code β no loops, no branches:
// Sort 27 elements with a depth-13 sorting network
int e0 = first;
int e1 = Unsafe.Add(ref first, 1);
// ... load e2 through e26 ...
// Layer 1 comparators:
if (e1 > e26) { int t = e1; e1 = e26; e26 = t; }
if (e2 > e25) { int t = e2; e2 = e25; e25 = t; }
if (e3 > e24) { int t = e3; e3 = e24; e24 = t; }
// ... 13 layers, ~185 total comparators ...
Key optimizations:
- No bounds checks β
Unsafe.Addskips array bounds validation - Type-specific
>β NoIComparablevirtual dispatch - Code-generated β
generate_unrolled.csproduces all the methods
First Attempt: The Generic Approach
The first implementation (PR #1) used a generic loop with IComparable<T>.CompareTo() β it was slower!
// Generic loop β overhead of CompareTo + virtual dispatch
private static void ApplyNetwork<T>(Span<T> span, int[] network)
where T : IComparable<T> {
for (int i = 0; i < network.Length; i += 2) {
ref T a = ref Unsafe.Add(ref first, network[i]);
if (a.CompareTo(b) > 0) { /* swap */ }
}
}
What went wrong:
IComparable<T>.CompareTo()β virtual dispatch on every comparison- Loop + array indexing β
network[i]indirection on every step - Generic constraints β prevents JIT from using native
>operator
SIMD: Hardware Intrinsics
The real performance breakthrough came from replacing scalar if/swap with vectorized min/max/blend β processing an entire layer at once:
- Shuffle β Rearrange elements to pair up comparators for this layer
- Min / Max β Compare ALL pairs in the layer simultaneously
- Select β Pick min or max for each position using a bitmask
// Simplified AVX2 example for byte (all 27 elements in one Vector256)
var shuffled = Vector256.Shuffle(vec, layerPermutation);
var mins = Vector256.Min(vec, shuffled);
var maxs = Vector256.Max(vec, shuffled);
vec = Vector256.ConditionalSelect(layerMask, maxs, mins);
Register packing by type:
- byte: All 27 in one
Vector256 - int: 4Γ
Vector256(8 each) - double: 7Γ
Vector256(4 each)
Trust, but Verify
AI can write code that looks fast. Only rigorous benchmarking tells you if it actually is.
[MemoryDiagnoser]
[SimpleJob(warmupCount: 5, iterationCount: 15)]
public class IntSortingBenchmarks {
[Params(27, 28)]
public int Length { get; set; }
[Benchmark(Baseline = true)]
public void ArraySort() => ...;
[Benchmark]
public void NetworkSort() => ...;
}
BenchmarkDotNet provided:
- Statistical rigor β Mean, error, StdDev, ratios with confidence
- Memory diagnostics β
[MemoryDiagnoser]confirms zero allocations - Batched ops β 1000 ops/invoke to reduce timer overhead
Remember PR #1? The generic IComparable loop looked correct but was slower than Array.Sort. BenchmarkDotNet caught this immediately. Without measurement, we would have shipped slower code.
CI: GitHub Actions
Every push and PR runs build, test, and benchmarks across 4 platforms β covering x86, ARM64, and all SIMD paths:
| Platform | Architecture | SIMD |
|---|---|---|
| ubuntu-latest | x64 | AVX2 + AVX-512 |
| ubuntu-24.04-arm | ARM64 | AdvSimd / NEON |
| windows-latest | x64 | AVX2 |
| macos-latest | ARM64 | AdvSimd / NEON |
A summarize job merges all benchmark results into the $GITHUB_STEP_SUMMARY β viewable directly in the Actions UI. Copilot Code Review runs on every PR.
Results
x86 (Intel Core i9-9900K, AVX2)
27 elements, .NET 10:
| Type | Array.Sort | NetworkSort | Speedup |
|---|---|---|---|
| sbyte | 1,452 ns | 41 ns | 35x |
| byte | 1,301 ns | 39 ns | 33x |
| float | 1,636 ns | 76 ns | 22x |
| double | 1,648 ns | 93 ns | 18x |
| short | 1,402 ns | 101 ns | 14x |
| long | 1,404 ns | 104 ns | 14x |
| uint | 107 ns | 54 ns | 2.0x |
| string | 979 ns | 527 ns | 1.9x |
| int | 101 ns | 57 ns | 1.8x |
| ulong | 118 ns | 100 ns | ~1.2x |
| char | 93 ns | 97 ns | ~1x |
charandulong: .NET 10 already has SIMD-optimizedArray.Sortfor these types.
ARM64 (Apple M1, AdvSimd / NEON)
27 elements, .NET 10:
| Type | Array.Sort | NetworkSort | Speedup |
|---|---|---|---|
| byte | 1,200 ns | 29 ns | 41x |
| sbyte | 1,230 ns | 30 ns | 41x |
| short | 1,264 ns | 54 ns | 23x |
| ushort | 1,254 ns | 54 ns | 23x |
| float | 1,363 ns | 78 ns | 17x |
| double | 1,523 ns | 111 ns | 14x |
| long | 1,254 ns | 106 ns | 12x |
| char | 98 ns | 50 ns | 2.0x |
| uint | 108 ns | 75 ns | 1.4x |
| int | 98 ns | 78 ns | 1.3x |
41x faster for byte on Apple M1 with NEON intrinsics.
The Journey: 59 PRs in 3 Days
Day 1 β Foundation
- PR #1: Generic sorting network (slower than BCL!)
- PR #2: Code-gen unrolled methods
- PR #3: Value locals for register allocation
- PR #5: Int-specific, focus on 27/28 only
Day 1 β Optimization
- PR #8: Specialize for all 13 primitive types
- PR #12: String overload with ordinal comparison
- PR #13: Port generator from Python to C#
- PR #14: AVX2 SIMD for byte/sbyte (33x!)
Day 2+ β SIMD Everywhere
- PR #17: ARM64 AdvSimd/NEON
- PR #20: AVX-512 for 16-bit types
- PR #35-39: AVX2/AVX-512 for int, float, double
- PR #58: ARM64 TBL1 + early-exit optimizations
All code-generated by scripts/generate_unrolled.cs β change the generator, regenerate everything.
Conclusions
- AI accelerates the research-to-code pipeline β From reading a paper to a working, optimized library in days, not weeks
- Copilot handled the tedious parts β Code generation scripts, test coverage, SIMD boilerplate, CI/CD setup
- Domain knowledge still matters β Understanding SIMD intrinsics, .NET JIT behavior, and benchmark methodology was essential
- The results speak for themselves β Up to 41x faster than
Array.Sortfor specialized sizes
Whatβs Next?
Could this go in the BCL? Sorting networks need a custom network per array length β youβd need networks for every size, not just 27/28. A C# Source Generator inside a NuGet package may be a better fit for this narrower use case.