coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] cksum: use pclmul instead of slice-by-32 for final bytes


From: Pádraig Brady
Subject: Re: [PATCH] cksum: use pclmul instead of slice-by-32 for final bytes
Date: Sun, 24 Nov 2024 14:17:50 +0000
User-agent: Mozilla Thunderbird Beta

On 24/11/2024 11:19, Sam Russell wrote:
The current implementation reads 64kB blocks and uses lookup tables for the
final 0-31 bytes (normally 16 bytes, meaning 16 lookups). I've replaced
this with the smaller folds and Barrett reduction from the intel paper.
Benchmarking is hard as there's a lot of variance, but it appears to give
around a noticeable improvement for a 4GB ISO (fastest time is 0.215s user
compared with fastest 0m0.451s on a AMD Ryzen 5 5600).

Future work is to remove this final reduction from the loop completely as
we're reading in multiples of 32 bytes and we can use the 4-fold method
exclusively until we get to the end of the file stream.

Open any feedback, especially as I've probably violated the code style
somewhere along the line.

Copyright: all my own work and have completed GNU copyright paperwork, the
algorithm is based off the Intel paper that the rest of the implementation
is also based on.


I see a slight perf regression on an i7-5600U CPU @ 2.60GHz:

# truncate -s4G file

# time taskset -c 0 chrt -f 99 src/cksum file
4215202376 4294967296 file
real    0m3.023s
...
real    0m3.005s
...
real    0m3.018s


$ patch -p1 < ~/0001-cksum-use-pclmul-instead-of-slice-by-32-for-final-by.patch
$ ./make --opt

# time taskset -c 0 chrt -f 99 src/cksum file
4215202376 4294967296 file
real    0m3.108s
...
real    0m3.092s
...
real    0m3.143s


Now that's a small enough regression on older hardware,
that a 2x improvement on newer hardware is worth doing.
However it's a bit surprising, and warrants more testing.

cheers,
Pádraig



reply via email to

[Prev in Thread] Current Thread [Next in Thread]