coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup


From: Sam Russell
Subject: Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup
Date: Tue, 26 Nov 2024 08:35:51 +0100

> However I don't see any changes in CFLAGS or builtin_cpu_supports() checks
> between the first and this patch. Am I missing something?

CFLAGS stayed the same because the compiler output is fine (my PC here
doesn't have AVX512 but it has a recent gcc that can build AVX512
instructions). It's possible the builtin_cpu_supports checks I used the
first time were correct, but the latest patch is the result of checking
intel manuals and the sanity checks against different servers.

> Also I was wondering how parameterizable the new code is.

After having worked on the coreutils code I can point out a few differences
with the gnulib code:
- Data is read 65536 bytes at a time and the implementation for this is a
while loop - we'd need to change this to a different pattern (like a "get
more data if available" buffer refiller like nginx has) to be able to match
this with gnulib (where the CRC function just takes a char* and any IO is
done outside of the loop). The performance hit on this is low (~1%) for
PCLMUL for folding everything to a 32-bit XOR every 64kB, but for AVX512
the hit is going to be more in the 5-10% range, so I'm wondering whether
gnulib should take a different interface (like a CTX object like openssl)
so that you can hit this over multiple iterations in crc32b_sum_stream().
I'll port this all to gnulib at some point, you just already have PCLMUL in
coreutils so it's a much smaller change.
- The endianness means we insert the prior CRC value at a different offset,
and the AVX logical shift opcodes don't take a var, they take an immediate
so we'd need to use a branch to decide this rather than having another
const for CRC-shift.

You could try to force it to conform to some abstraction but I don't think
it's helpful here, it just makes the code a little harder to follow.

There's potentially an argument for parametrising the AVX/AVX2/AVX512
implementations to just inject different constants, but ultimately I'm not
scared of the extra boilerplate if it means each implementation is easy to
follow when reading it. DRY is a good goal and IMO the value of it is to
show the developer where the logic abstractions are rather than to save any
space in the pre-compiled code.

If we wanted to parametrise the ideal use case would be between CRC-32
(coreutils) and CRC-32C (iSCSI, SCTP, ext4 etc) - here we're changing out
the polynomial (and therefore constants), but the ordering is the same
- QEMU uses the CRC-32C normal polynomial of 0x1EDC6F41 (big endian) so
this could be parametrised against the coreutils 0x04C11DB7 polynomial
- Linux uses the CRC-32C reversed polynomial of 0x82F63B78 (little endian)
and could be parametrised against the gzip 0xEDB88320 reversed polynomial

In practice Linux has crc32_le() and crc32_be() as separate functions and I
also believe it's better to split this way. In other words, the code for
crc32 (coreutils implementation, big endian) and crc32b (coreutils name for
gzip, little endian) should be kept separate, but we could easily support
other polynomials (CRC-32C, CRC-32K, CRC-32K2, CRC-32Q) for either big
endian or small endian.

tl;dr it's possibly parameterizable but I'd recommend against it


On Tue, 26 Nov 2024 at 01:10, Pádraig Brady <P@draigbrady.com> wrote:

> On 25/11/2024 23:27, Sam Russell wrote:
> > The intrinsics guide is a nice find, I dug a bit deeper into the Intel®
> > Architecture Instruction Set Extensions and Future Features Programming
> > Reference [1] from March 2018 and it shows the 4 variants:
> >
> > VEX.NDS.256.66.0F3A.WIG 44 /r /ib VPCLMULQDQ ymm1, ymm2, ymm3/m256, imm8
> > CPUID feature flag:  VPCLMULQDQ
> >
> > EVEX.NDS.128.66.0F3A.WIG 44 /r /ib VPCLMULQDQ xmm1, xmm2, xmm3/m128, imm8
> > CPUID feature flag: AVX512VL, VPCLMULQDQ
> >
> > EVEX.NDS.256.66.0F3A.WIG 44 /r /ib VPCLMULQDQ ymm1, ymm2, ymm3/m256, imm8
> > CPUID feature flag: AVX512VL, VPCLMULQDQ
> >
> > EVEX.NDS.512.66.0F3A.WIG 44 /r /ib VPCLMULQDQ zmm1, zmm2, zmm3/m512, imm8
> > CPUID feature flag: AVX512F, VPCLMULQDQ
> >
> > So the VPCLMULQDQ opcode needs AVX512VL and VPCLMULQDQ to be encoded with
> > the EVEX prefix (and use xmm/ymm), or AVX512F and VPCLMULQDQ to use zmm,
> > but only VPCLMULQDQ to be encoded with the VEX prefix for avx256. The
> build
> > flags for the cksum_avx2 object are `-mpclmul -mavx -mavx2 -mvpclmulqdq`
> so
> > the lack of any avx512 support should ensure it compiles to VEX and not
> > EVEX.
>
> Thanks for all the investigation.
> However I don't see any changes in CFLAGS or builtin_cpu_supports() checks
> between the first and this patch. Am I missing something?
>
> Also I was wondering how parameterizable the new code is.
> I.e. would it be easy to parameterize to support -a crc32b?
>  From my previous notes on the gnulib list I summarized the differences as:
>
> cksum -a crc parameters:
> ------------------------
> Polynomial: 04C11DB7
> Initial Value: 00000000
> Final XOR Value: 00000000
> Reverse data: no
> Reverse crc (before xor): no
>
> cksum -a crc32b (gnulib crc32) equivalent parameters:
> ------------------------
> Polynomial: 04C11DB7
> Initial Value: FFFFFFFF
> Final XOR Value: FFFFFFFF
> Reverse data: yes
> Reverse crc (before xor): yes
>
> cheers,
> Pádraig
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]