I've ported the PCLMUL to for ARMv8 support, looks to be an 80% time
reduction over CPU on an EC2 T4g instance:
$ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Vendor ID: ARM
Model name: Neoverse-N1
Model: 1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
Stepping: r3p1
BogoMIPS: 243.75
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32
atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
# ubuntu 24.04 package
$ time cksum ubuntu.iso
914429447 2773874688 ubuntu.iso
real 0m20.136s
user 0m2.044s
sys 0m1.691s
# built from head
$ time ./cksum_old ubuntu.iso
914429447 2773874688 ubuntu.iso
real 0m20.217s
user 0m2.022s
sys 0m1.770s
# this patch using only pmull opcodes
$ time ./cksum_neon ubuntu.iso
914429447 2773874688 ubuntu.iso
real 0m20.135s
user 0m0.353s
sys 0m1.819s
# this patch using pmull and pmull2 opcodes
$ time ./cksum_neon2 ubuntu.iso
914429447 2773874688 ubuntu.iso
real 0m20.136s
user 0m0.346s
sys 0m1.819s
Benchmark scripts (I used the crc_sum_stream() function so the hash output
is different, but have verified against the pclmul script functions locally)
$ time ./cksum_bench_old 65536 400000
Hash: 8984ED89, length: 65536
real 0m19.300s
user 0m19.299s
sys 0m0.001s
$ time ./cksum_bench_neon2 65536 400000
Hash: 828F9BAC, length: 65536
real 0m5.001s
user 0m4.997s
sys 0m0.003s
For hash validation
$ time ./cksum_bench_neon2 1048576 40000
Hash: EFA0B24F, length: 1048576
real 0m7.540s
user 0m7.538s
sys 0m0.001s
$ time ./cksum_bench_pclmul 1048576 10000
Hash: EFA0B24F, length: 1048576
real 0m3.018s
user 0m3.018s
sys 0m0.000s
-O3 does most of the optimisation work for us, there may be more savings
but this is still a good improvement.
Some questions
- There's no direct equivalent of "__builtin_cpu_supports" for ARM, but the
hwcaps interface seems to be the way to test this [1] [2]
- ARM is a much more diverse system than x86_64, it's possible that some
platforms (e.g. phones) would see a slowdown, is this something we want to
give maintainers a flag to disable?
- ARMv8 also has a CRC32() opcode, a quick test showed it wasn't super
efficient but it's possible that interleaving this against the folding
approach might add extra speedups. This is an exercise for the reader.