Sorting Networks: From Paper to Performance with AI

April 2026 β€” Jonathan Peppers

Can AI help a developer move faster from research paper to working, optimized code? I set out to answer that question by finding a recent CS paper, implementing its algorithm in C# with GitHub Copilot, and benchmarking it against .NET’s built-in sort. The result: up to 41x faster for specialized sizes β€” built in 3 days with 59 PRs.

GitHub repo: jonathanpeppers/SortingNetworks

Why This Talk

Most developers haven’t seen what AI can do today. The tooling has moved fast β€” it’s worth showing real examples, not just hype. I have access to a lot of tokens, so let me show you what’s possible with a real project, start to finish. My entire workflow has changed β€” I don’t use an IDE anymore. This project was built entirely with AI-assisted tools.

The Premise

  1. Find a Paper β€” Discover a recent CS paper with a new algorithm (November 2025)
  2. Use AI to Implement β€” Use GitHub Copilot to implement the algorithm in C# / .NET
  3. Benchmark It β€” Compare against the existing Base Class Library (BCL) sort

Prior Art: Paper β†’ Implementation

This pattern isn’t new. John Carmack did it in 1993 when building the Doom engine β€” he found an academic paper on Binary Space Partitioning (BSP) trees by Fuchs, Kedem & Naylor (SIGGRAPH 1980) and implemented it to subdivide game levels so the engine only rendered visible surfaces. That breakthrough made Doom’s sprawling levels run in real-time on early ’90s PCs.

The pattern: Find a Paper β†’ Implement the Algorithm β†’ Ship Something Faster

GitHub Copilot Products Used

The Paper

β€œDepth-13 Sorting Networks for 28 Channels” Chengu Wang β€” arXiv:2511.04107 β€” Published November 6, 2025

Key Finding: Improved depth upper bounds for 27 and 28 channel sorting networks from 14 to 13 layers, using reflectional symmetry + SAT solver.

Timeline:

How Sorting Networks Work

A sorting network is a fixed sequence of compare-and-swap operations. Unlike quicksort or mergesort, the sequence of comparisons is determined entirely by the input length β€” not by the data values.

Key properties:

Each compare-and-swap is simple:

if (a > b) {
    swap(a, b)
}

This paper: 28 channels, 13 layers deep, ~185 comparators (previous best: 14 layers).

A Simple Example: Sort 3 Elements

A depth-3 network with 3 comparators β€” the full 27-element version uses the same pattern at scale:

// Sort 3 elements with a sorting network
static void Sort3(ref int e0, ref int e1, ref int e2)
{
    // Layer 1
    if (e0 > e1) { int t = e0; e0 = e1; e1 = t; }

    // Layer 2
    if (e1 > e2) { int t = e1; e1 = e2; e2 = t; }

    // Layer 3
    if (e0 > e1) { int t = e0; e0 = e1; e1 = t; }
}

Walkthrough with [5, 1, 3]:

Step Action Result
Layer 1 Compare e0, e1 β†’ swap! [1, 5, 3]
Layer 2 Compare e1, e2 β†’ swap! [1, 3, 5]
Layer 3 Compare e0, e1 β†’ no swap [1, 3, 5] βœ“ Sorted!

Same pattern at scale: 27 elements Γ— 13 layers Γ— ~185 comparators β€” all code-generated.

C# Scalar Implementation

Every compare-and-swap is unrolled into straight-line code β€” no loops, no branches:

// Sort 27 elements with a depth-13 sorting network
int e0 = first;
int e1 = Unsafe.Add(ref first, 1);
// ... load e2 through e26 ...

// Layer 1 comparators:
if (e1  > e26) { int t = e1;  e1  = e26; e26 = t; }
if (e2  > e25) { int t = e2;  e2  = e25; e25 = t; }
if (e3  > e24) { int t = e3;  e3  = e24; e24 = t; }
// ... 13 layers, ~185 total comparators ...

Key optimizations:

First Attempt: The Generic Approach

The first implementation (PR #1) used a generic loop with IComparable<T>.CompareTo() β€” it was slower!

// Generic loop β€” overhead of CompareTo + virtual dispatch
private static void ApplyNetwork<T>(Span<T> span, int[] network)
    where T : IComparable<T> {
  for (int i = 0; i < network.Length; i += 2) {
    ref T a = ref Unsafe.Add(ref first, network[i]);
    if (a.CompareTo(b) > 0) { /* swap */ }
  }
}

What went wrong:

SIMD: Hardware Intrinsics

The real performance breakthrough came from replacing scalar if/swap with vectorized min/max/blend β€” processing an entire layer at once:

  1. Shuffle β€” Rearrange elements to pair up comparators for this layer
  2. Min / Max β€” Compare ALL pairs in the layer simultaneously
  3. Select β€” Pick min or max for each position using a bitmask
// Simplified AVX2 example for byte (all 27 elements in one Vector256)
var shuffled = Vector256.Shuffle(vec, layerPermutation);
var mins = Vector256.Min(vec, shuffled);
var maxs = Vector256.Max(vec, shuffled);
vec = Vector256.ConditionalSelect(layerMask, maxs, mins);

Register packing by type:

Trust, but Verify

AI can write code that looks fast. Only rigorous benchmarking tells you if it actually is.

[MemoryDiagnoser]
[SimpleJob(warmupCount: 5, iterationCount: 15)]
public class IntSortingBenchmarks {
  [Params(27, 28)]
  public int Length { get; set; }

  [Benchmark(Baseline = true)]
  public void ArraySort() => ...;

  [Benchmark]
  public void NetworkSort() => ...;
}

BenchmarkDotNet provided:

Remember PR #1? The generic IComparable loop looked correct but was slower than Array.Sort. BenchmarkDotNet caught this immediately. Without measurement, we would have shipped slower code.

CI: GitHub Actions

Every push and PR runs build, test, and benchmarks across 4 platforms β€” covering x86, ARM64, and all SIMD paths:

Platform Architecture SIMD
ubuntu-latest x64 AVX2 + AVX-512
ubuntu-24.04-arm ARM64 AdvSimd / NEON
windows-latest x64 AVX2
macos-latest ARM64 AdvSimd / NEON

A summarize job merges all benchmark results into the $GITHUB_STEP_SUMMARY β€” viewable directly in the Actions UI. Copilot Code Review runs on every PR.

Results

x86 (Intel Core i9-9900K, AVX2)

27 elements, .NET 10:

Type Array.Sort NetworkSort Speedup
sbyte 1,452 ns 41 ns 35x
byte 1,301 ns 39 ns 33x
float 1,636 ns 76 ns 22x
double 1,648 ns 93 ns 18x
short 1,402 ns 101 ns 14x
long 1,404 ns 104 ns 14x
uint 107 ns 54 ns 2.0x
string 979 ns 527 ns 1.9x
int 101 ns 57 ns 1.8x
ulong 118 ns 100 ns ~1.2x
char 93 ns 97 ns ~1x

char and ulong: .NET 10 already has SIMD-optimized Array.Sort for these types.

ARM64 (Apple M1, AdvSimd / NEON)

27 elements, .NET 10:

Type Array.Sort NetworkSort Speedup
byte 1,200 ns 29 ns 41x
sbyte 1,230 ns 30 ns 41x
short 1,264 ns 54 ns 23x
ushort 1,254 ns 54 ns 23x
float 1,363 ns 78 ns 17x
double 1,523 ns 111 ns 14x
long 1,254 ns 106 ns 12x
char 98 ns 50 ns 2.0x
uint 108 ns 75 ns 1.4x
int 98 ns 78 ns 1.3x

41x faster for byte on Apple M1 with NEON intrinsics.

The Journey: 59 PRs in 3 Days

Day 1 β€” Foundation

Day 1 β€” Optimization

Day 2+ β€” SIMD Everywhere

All code-generated by scripts/generate_unrolled.cs β€” change the generator, regenerate everything.

Conclusions

What’s Next?

Could this go in the BCL? Sorting networks need a custom network per array length β€” you’d need networks for every size, not just 27/28. A C# Source Generator inside a NuGet package may be a better fit for this narrower use case.