[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] cksum: use pclmul instead of slice-by-32 for final bytes
From: |
Sam Russell |
Subject: |
Re: [PATCH] cksum: use pclmul instead of slice-by-32 for final bytes |
Date: |
Sun, 24 Nov 2024 15:34:38 +0100 |
What do you get over 10 iterations? There's a ton of variance and a proper
benchmarking tool would give a more accurate result. It's not the order of
magnitude speedup from slice-by-8 to pclmul but I would expect it to be
faster than the table lookup, perhaps it's a <10% improvement (1/4096
calculations is going to be in the order of <100x faster). There's also
value in that we don't need to load/generate the lookup table when doing
the pclmul version of CRC.
On Sun, Nov 24, 2024, 15:17 Pádraig Brady <P@draigbrady.com> wrote:
> On 24/11/2024 11:19, Sam Russell wrote:
> > The current implementation reads 64kB blocks and uses lookup tables for
> the
> > final 0-31 bytes (normally 16 bytes, meaning 16 lookups). I've replaced
> > this with the smaller folds and Barrett reduction from the intel paper.
> > Benchmarking is hard as there's a lot of variance, but it appears to give
> > around a noticeable improvement for a 4GB ISO (fastest time is 0.215s
> user
> > compared with fastest 0m0.451s on a AMD Ryzen 5 5600).
> >
> > Future work is to remove this final reduction from the loop completely as
> > we're reading in multiples of 32 bytes and we can use the 4-fold method
> > exclusively until we get to the end of the file stream.
> >
> > Open any feedback, especially as I've probably violated the code style
> > somewhere along the line.
> >
> > Copyright: all my own work and have completed GNU copyright paperwork,
> the
> > algorithm is based off the Intel paper that the rest of the
> implementation
> > is also based on.
>
>
> I see a slight perf regression on an i7-5600U CPU @ 2.60GHz:
>
> # truncate -s4G file
>
> # time taskset -c 0 chrt -f 99 src/cksum file
> 4215202376 4294967296 file
> real 0m3.023s
> ...
> real 0m3.005s
> ...
> real 0m3.018s
>
>
> $ patch -p1 <
> ~/0001-cksum-use-pclmul-instead-of-slice-by-32-for-final-by.patch
> $ ./make --opt
>
> # time taskset -c 0 chrt -f 99 src/cksum file
> 4215202376 4294967296 file
> real 0m3.108s
> ...
> real 0m3.092s
> ...
> real 0m3.143s
>
>
> Now that's a small enough regression on older hardware,
> that a 2x improvement on newer hardware is worth doing.
> However it's a bit surprising, and warrants more testing.
>
> cheers,
> Pádraig
>