[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] cksum: use pclmul instead of slice-by-32 for final bytes
From: |
Pádraig Brady |
Subject: |
Re: [PATCH] cksum: use pclmul instead of slice-by-32 for final bytes |
Date: |
Sun, 24 Nov 2024 14:17:50 +0000 |
User-agent: |
Mozilla Thunderbird Beta |
On 24/11/2024 11:19, Sam Russell wrote:
The current implementation reads 64kB blocks and uses lookup tables for the
final 0-31 bytes (normally 16 bytes, meaning 16 lookups). I've replaced
this with the smaller folds and Barrett reduction from the intel paper.
Benchmarking is hard as there's a lot of variance, but it appears to give
around a noticeable improvement for a 4GB ISO (fastest time is 0.215s user
compared with fastest 0m0.451s on a AMD Ryzen 5 5600).
Future work is to remove this final reduction from the loop completely as
we're reading in multiples of 32 bytes and we can use the 4-fold method
exclusively until we get to the end of the file stream.
Open any feedback, especially as I've probably violated the code style
somewhere along the line.
Copyright: all my own work and have completed GNU copyright paperwork, the
algorithm is based off the Intel paper that the rest of the implementation
is also based on.
I see a slight perf regression on an i7-5600U CPU @ 2.60GHz:
# truncate -s4G file
# time taskset -c 0 chrt -f 99 src/cksum file
4215202376 4294967296 file
real 0m3.023s
...
real 0m3.005s
...
real 0m3.018s
$ patch -p1 < ~/0001-cksum-use-pclmul-instead-of-slice-by-32-for-final-by.patch
$ ./make --opt
# time taskset -c 0 chrt -f 99 src/cksum file
4215202376 4294967296 file
real 0m3.108s
...
real 0m3.092s
...
real 0m3.143s
Now that's a small enough regression on older hardware,
that a 2x improvement on newer hardware is worth doing.
However it's a bit surprising, and warrants more testing.
cheers,
Pádraig