coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] cksum: Use ARMv8 SIMD extensions


From: Pádraig Brady
Subject: Re: [PATCH] cksum: Use ARMv8 SIMD extensions
Date: Thu, 28 Nov 2024 22:10:35 +0000
User-agent: Mozilla Thunderbird Beta

On 28/11/2024 19:59, Sam Russell wrote:
I've ported the PCLMUL to for ARMv8 support, looks to be an 80% time
reduction over CPU on an EC2 T4g instance:

$ lscpu
Architecture:             aarch64
   CPU op-mode(s):         32-bit, 64-bit
   Byte Order:             Little Endian
CPU(s):                   2
   On-line CPU(s) list:    0,1
Vendor ID:                ARM
   Model name:             Neoverse-N1
     Model:                1
     Thread(s) per core:   1
     Core(s) per socket:   2
     Socket(s):            1
     Stepping:             r3p1
     BogoMIPS:             243.75
     Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32
atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

# ubuntu 24.04 package
$ time cksum ubuntu.iso
914429447 2773874688 ubuntu.iso

real    0m20.136s
user    0m2.044s
sys     0m1.691s

# built from head
$ time ./cksum_old ubuntu.iso
914429447 2773874688 ubuntu.iso

real    0m20.217s
user    0m2.022s
sys     0m1.770s

# this patch using only pmull opcodes
$ time ./cksum_neon ubuntu.iso
914429447 2773874688 ubuntu.iso

real    0m20.135s
user    0m0.353s
sys     0m1.819s

# this patch using pmull and pmull2 opcodes
$ time ./cksum_neon2 ubuntu.iso
914429447 2773874688 ubuntu.iso

real    0m20.136s
user    0m0.346s
sys     0m1.819s

Benchmark scripts (I used the crc_sum_stream() function so the hash output
is different, but have verified against the pclmul script functions locally)

$ time ./cksum_bench_old 65536 400000
Hash: 8984ED89, length: 65536

real    0m19.300s
user    0m19.299s
sys     0m0.001s

$ time ./cksum_bench_neon2 65536 400000
Hash: 828F9BAC, length: 65536

real    0m5.001s
user    0m4.997s
sys     0m0.003s

For hash validation

$ time ./cksum_bench_neon2 1048576 40000
Hash: EFA0B24F, length: 1048576

real    0m7.540s
user    0m7.538s
sys     0m0.001s

$ time ./cksum_bench_pclmul 1048576 10000
Hash: EFA0B24F, length: 1048576

real    0m3.018s
user    0m3.018s
sys     0m0.000s

-O3 does most of the optimisation work for us, there may be more savings
but this is still a good improvement.

Some questions
- There's no direct equivalent of "__builtin_cpu_supports" for ARM, but the
hwcaps interface seems to be the way to test this [1] [2]
- ARM is a much more diverse system than x86_64, it's possible that some
platforms (e.g. phones) would see a slowdown, is this something we want to
give maintainers a flag to disable?
- ARMv8 also has a CRC32() opcode, a quick test showed it wasn't super
efficient but it's possible that interleaving this against the folding
approach might add extra speedups. This is an exercise for the reader.

Cool. I'll try this out on some of the arm64 machines at:
https://portal.cfarm.net/machines/list/

Note builders can disable this already with:
./configure utils_cv_vmull_intrinsic_exists=no

thanks!
Pádraig



reply via email to

[Prev in Thread] Current Thread [Next in Thread]