[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup
From: |
Sam Russell |
Subject: |
Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup |
Date: |
Tue, 26 Nov 2024 00:27:38 +0100 |
The intrinsics guide is a nice find, I dug a bit deeper into the Intel®
Architecture Instruction Set Extensions and Future Features Programming
Reference [1] from March 2018 and it shows the 4 variants:
VEX.NDS.256.66.0F3A.WIG 44 /r /ib VPCLMULQDQ ymm1, ymm2, ymm3/m256, imm8
CPUID feature flag: VPCLMULQDQ
EVEX.NDS.128.66.0F3A.WIG 44 /r /ib VPCLMULQDQ xmm1, xmm2, xmm3/m128, imm8
CPUID feature flag: AVX512VL, VPCLMULQDQ
EVEX.NDS.256.66.0F3A.WIG 44 /r /ib VPCLMULQDQ ymm1, ymm2, ymm3/m256, imm8
CPUID feature flag: AVX512VL, VPCLMULQDQ
EVEX.NDS.512.66.0F3A.WIG 44 /r /ib VPCLMULQDQ zmm1, zmm2, zmm3/m512, imm8
CPUID feature flag: AVX512F, VPCLMULQDQ
So the VPCLMULQDQ opcode needs AVX512VL and VPCLMULQDQ to be encoded with
the EVEX prefix (and use xmm/ymm), or AVX512F and VPCLMULQDQ to use zmm,
but only VPCLMULQDQ to be encoded with the VEX prefix for avx256. The build
flags for the cksum_avx2 object are `-mpclmul -mavx -mavx2 -mvpclmulqdq` so
the lack of any avx512 support should ensure it compiles to VEX and not
EVEX.
I did some more tests on some EC2 instances, the T2.micro does this
$ ./cksum_bench_avx2 1024 1024
Hash: 6431C527, length: 1024
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 1
On-line CPU(s) list: 0
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
CPU family: 6
Model: 79
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 1
Stepping: 1
BogoMIPS: 4599.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm
constant_tsc rep_good nopl xtopology cpu
id tsc_known_freq pni pclmulqdq ssse3 fma cx16
pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx
f16c rdrand hypervisor lahf_lm abm pti fsgs
base bmi1 avx2 smep bmi2 erms invpcid xsaveopt
The T3.micro (skylake) does this:
$ ./cksum_bench_avx2 1024 1024
Illegal instruction (core dumped)
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
CPU family: 6
Model: 85
Thread(s) per core: 2
Core(s) per socket: 1
Socket(s): 1
Stepping: 4
BogoMIPS: 4999.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb
rdtscp lm constant_tsc rep_good nopl xt
opology nonstop_tsc cpuid tsc_known_freq pni
pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt
tsc_deadline_timer aes xsave avx f16c rdrand hypervis
or lahf_lm abm 3dnowprefetch pti fsgsbase
tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx
smap clflushopt clwb avx512cd avx512bw avx51
2vl xsaveopt xsavec xgetbv1 xsaves ida arat pku
ospke
So despite the skylake being a "better" processor and having nearly all the
avx512 extensions, it doesn't have vpclmulqdq set and therefore can't
process the VEX encoded AVX2 vpclmulqdq opcode. The broadwell on the other
hand does handle the vpclmulqdq opcode and works, although it doesn't
have vpclmulqdq set so we in theory shouldn't have tried and instead just
fallen back on the AVX pclmul for safety.
So the checks I propose are:
avx512:
VPCLMULQDQ and AVX512F (as per [1]) and AVX512BW (needed for the byteswap
operation)
avx2:
VPCLMULQDQ (we compile this with the VEX extension afaik so the AVX512VL
flag shouldn't be necessary on AVX512-capable processors, and this is
targeted for AVX2-capable processors) and AVX2 (to confirm the rest of the
opcodes are there)
pclmul:
PCLMUL and AVX (existing check)
Attached patch has these updated checks in place
[1] <https://kib.kiev.ua/x86docs/Intel/ISAFuture/319433-033.pdf>
On Mon, 25 Nov 2024 at 23:42, Jeffrey Walton <noloader@gmail.com> wrote:
> On Mon, Nov 25, 2024 at 5:31 PM Sam Russell <sam.h.russell@gmail.com>
> wrote:
> >
> > Results thanks to Jeff
> >
> > srussell@icelake:~$ time ./cksum_bench_pclmul 1048575 10000
> > Hash: 5B9DA0F4, length: 1048575
> >
> > real 0m3.561s
> > user 0m3.535s
> > sys 0m0.026s
> > srussell@icelake:~$ time ./cksum_bench_avx2 1048575 10000
> > Hash: 5B9DA0F4, length: 1048575
> >
> > real 0m2.083s
> > user 0m2.047s
> > sys 0m0.036s
> > srussell@icelake:~$ time ./cksum_bench_avx512 1048575 10000
> > Hash: 5B9DA0F4, length: 1048575
> >
> > real 0m1.353s
> > user 0m1.320s
> > sys 0m0.033s
> >
> > Zero code change in the algorithm so we're effectively testing whether
> I've calculated the constants correctly and whether I'm loading the
> previous CRC into the correct part of the AVX register.
> >
> > Attached patch has Pádraig's feedback plus the new runtime check that
> will enable the AVX2 version if avx512f is specified but the
> avx512_supported() check has failed (because vpclmulqdq isn't set). I would
> appreciate if anyone has a definitive answer on the correct way to test for
> avx2+vpclmulqdq vs avx512+vpclmulqdq, and whether any chip exists that
> supports a subset avx512 but also doesn't support vpclmulqdq on avx2...
>
> I don't believe you will encounter avx2+vpclmulqdq. According to the
> Intel Intrinsic Guide,[1] vpclmulqdq is AVX512. If you have AVX512,
> then AVX2 is a proper subset available to you. (You won't find AVX2
> plus a few AVX512 features. That combination will not show up on AVX2
> machines, like Skylake or Kaby Lake).
>
> According to the Intel Intrinsic Guide,[1] you should check for
> VPCLMULQDQ+AVX512VL _if_ you are using vpclmulqdq ymm, ymm, ymm, imm8
> form. You should check for VPCLMULQDQ alone _if_ you are using the
> vpclmulqdq zmm, zmm, zmm, imm8 form.
>
> [1] <
> https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=vpclmulqdq
> >.
>
> Jeff
>
0001-cksum-Use-AVX2-and-AVX512-for-speedup.patch
Description: Binary data
- [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell, 2024/11/25
- Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Pádraig Brady, 2024/11/25
- Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Jeffrey Walton, 2024/11/25
- Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell, 2024/11/25
- Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell, 2024/11/25
- Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Jeffrey Walton, 2024/11/25
- Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup,
Sam Russell <=
- Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Pádraig Brady, 2024/11/25
- Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell, 2024/11/26
- Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Pádraig Brady, 2024/11/26
- Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell, 2024/11/26
- Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell, 2024/11/26
- Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Pádraig Brady, 2024/11/26
- Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell, 2024/11/26
- Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Michael Stone, 2024/11/27
Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sylvestre Ledru, 2024/11/25