modernizing rsync for the multi-gigabit era
the problem with rsync in 2026
rsync's rolling checksum algorithm was designed in 1996 for slow networks where bandwidth was the bottleneck. today, with 10GbE and NVMe storage, the bottleneck has moved: the Adler-32 checksum computation is CPU-bound.
the adler-32 ceiling
Adler-32 is fast, but it's sequential. you can't parallelize it without changing the algorithm. on a single core, you can checksum about 3GB/s. that's slower than a modern NVMe drive.
content-defined chunking
the fix is content-defined chunking (CDC). instead of fixed-size blocks, CDC splits data at content-dependent boundaries — specific byte patterns that appear in the data stream.
struct CdcChunker {
hasher: RollingHash,
min_size: usize,
max_size: usize,
mask: u64,
}
because boundaries are content-defined, inserting data at the beginning of a file doesn't shift all subsequent chunk boundaries. this dramatically improves deduplication for modified files.
blake3 for verification
replace MD5 chunk verification with BLAKE3. it's faster than MD5 on modern hardware with AVX2, parallelizable across cores, and cryptographically secure.