coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup


From: Sam Russell
Subject: Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup
Date: Tue, 26 Nov 2024 00:27:38 +0100

The intrinsics guide is a nice find, I dug a bit deeper into the Intel®
Architecture Instruction Set Extensions and Future Features Programming
Reference [1] from March 2018 and it shows the 4 variants:

VEX.NDS.256.66.0F3A.WIG 44 /r /ib VPCLMULQDQ ymm1, ymm2, ymm3/m256, imm8
CPUID feature flag:  VPCLMULQDQ

EVEX.NDS.128.66.0F3A.WIG 44 /r /ib VPCLMULQDQ xmm1, xmm2, xmm3/m128, imm8
CPUID feature flag: AVX512VL, VPCLMULQDQ

EVEX.NDS.256.66.0F3A.WIG 44 /r /ib VPCLMULQDQ ymm1, ymm2, ymm3/m256, imm8
CPUID feature flag: AVX512VL, VPCLMULQDQ

EVEX.NDS.512.66.0F3A.WIG 44 /r /ib VPCLMULQDQ zmm1, zmm2, zmm3/m512, imm8
CPUID feature flag: AVX512F, VPCLMULQDQ

So the VPCLMULQDQ opcode needs AVX512VL and VPCLMULQDQ to be encoded with
the EVEX prefix (and use xmm/ymm), or AVX512F and VPCLMULQDQ to use zmm,
but only VPCLMULQDQ to be encoded with the VEX prefix for avx256. The build
flags for the cksum_avx2 object are `-mpclmul -mavx -mavx2 -mvpclmulqdq` so
the lack of any avx512 support should ensure it compiles to VEX and not
EVEX.

I did some more tests on some EC2 instances, the T2.micro does this

$ ./cksum_bench_avx2 1024 1024
Hash: 6431C527, length: 1024
$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   1
  On-line CPU(s) list:    0
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
    CPU family:           6
    Model:                79
    Thread(s) per core:   1
    Core(s) per socket:   1
    Socket(s):            1
    Stepping:             1
    BogoMIPS:             4599.99
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm
constant_tsc rep_good nopl xtopology cpu
                          id tsc_known_freq pni pclmulqdq ssse3 fma cx16
pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx
f16c rdrand hypervisor lahf_lm abm pti fsgs
                          base bmi1 avx2 smep bmi2 erms invpcid xsaveopt

The T3.micro (skylake) does this:

$ ./cksum_bench_avx2 1024 1024
Illegal instruction (core dumped)
$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   2
  On-line CPU(s) list:    0,1
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
    CPU family:           6
    Model:                85
    Thread(s) per core:   2
    Core(s) per socket:   1
    Socket(s):            1
    Stepping:             4
    BogoMIPS:             4999.99
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb
rdtscp lm constant_tsc rep_good nopl xt
                          opology nonstop_tsc cpuid tsc_known_freq pni
pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt
tsc_deadline_timer aes xsave avx f16c rdrand hypervis
                          or lahf_lm abm 3dnowprefetch pti fsgsbase
tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx
smap clflushopt clwb avx512cd avx512bw avx51
                          2vl xsaveopt xsavec xgetbv1 xsaves ida arat pku
ospke

So despite the skylake being a "better" processor and having nearly all the
avx512 extensions, it doesn't have vpclmulqdq set and therefore can't
process the VEX encoded AVX2 vpclmulqdq opcode. The broadwell on the other
hand does handle the vpclmulqdq opcode and works, although it doesn't
have vpclmulqdq set so we in theory shouldn't have tried and instead just
fallen back on the AVX pclmul for safety.

So the checks I propose are:

avx512:
VPCLMULQDQ and AVX512F (as per [1]) and AVX512BW (needed for the byteswap
operation)

avx2:
VPCLMULQDQ (we compile this with the VEX extension afaik so the AVX512VL
flag shouldn't be necessary on AVX512-capable processors, and this is
targeted for AVX2-capable processors) and AVX2 (to confirm the rest of the
opcodes are there)

pclmul:
PCLMUL and AVX (existing check)

Attached patch has these updated checks in place

[1] <https://kib.kiev.ua/x86docs/Intel/ISAFuture/319433-033.pdf>

On Mon, 25 Nov 2024 at 23:42, Jeffrey Walton <noloader@gmail.com> wrote:

> On Mon, Nov 25, 2024 at 5:31 PM Sam Russell <sam.h.russell@gmail.com>
> wrote:
> >
> > Results thanks to Jeff
> >
> > srussell@icelake:~$ time ./cksum_bench_pclmul 1048575 10000
> > Hash: 5B9DA0F4, length: 1048575
> >
> > real    0m3.561s
> > user    0m3.535s
> > sys     0m0.026s
> > srussell@icelake:~$ time ./cksum_bench_avx2 1048575 10000
> > Hash: 5B9DA0F4, length: 1048575
> >
> > real    0m2.083s
> > user    0m2.047s
> > sys     0m0.036s
> > srussell@icelake:~$ time ./cksum_bench_avx512 1048575 10000
> > Hash: 5B9DA0F4, length: 1048575
> >
> > real    0m1.353s
> > user    0m1.320s
> > sys     0m0.033s
> >
> > Zero code change in the algorithm so we're effectively testing whether
> I've calculated the constants correctly and whether I'm loading the
> previous CRC into the correct part of the AVX register.
> >
> > Attached patch has Pádraig's feedback plus the new runtime check that
> will enable the AVX2 version if avx512f is specified but the
> avx512_supported() check has failed (because vpclmulqdq isn't set). I would
> appreciate if anyone has a definitive answer on the correct way to test for
> avx2+vpclmulqdq vs avx512+vpclmulqdq, and whether any chip exists that
> supports a subset avx512 but also doesn't support vpclmulqdq on avx2...
>
> I don't believe you will encounter avx2+vpclmulqdq. According to the
> Intel Intrinsic Guide,[1] vpclmulqdq is AVX512. If you have AVX512,
> then AVX2 is a proper subset available to you. (You won't find AVX2
> plus a few AVX512 features. That combination will not show up on AVX2
> machines, like Skylake or Kaby Lake).
>
> According to the Intel Intrinsic Guide,[1] you should check for
> VPCLMULQDQ+AVX512VL _if_ you are using vpclmulqdq ymm, ymm, ymm, imm8
> form. You should check for VPCLMULQDQ alone _if_ you are using the
> vpclmulqdq zmm, zmm, zmm, imm8 form.
>
> [1] <
> https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=vpclmulqdq
> >.
>
> Jeff
>

Attachment: 0001-cksum-Use-AVX2-and-AVX512-for-speedup.patch
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]