Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup

From:	Sam Russell
Subject:	Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup
Date:	Mon, 25 Nov 2024 19:05:56 +0100

Actually, looking over https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html
shows this

‘icelake-client’
Intel Ice Lake Client CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3,
SSSE3, SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, XSAVE, PCLMUL,
FSGSBASE, RDRND, F16C, AVX2, BMI, BMI2, LZCNT, FMA, MOVBE, HLE, RDSEED,
ADCX, PREFETCHW, AES, CLFLUSHOPT, XSAVEC, XSAVES, SGX, AVX512F, AVX512VL,
AVX512BW, AVX512DQ, AVX512CD, PKU, AVX512VBMI, AVX512IFMA, SHA, AVX512VNNI,
GFNI, VAES, AVX512VBMI2 , VPCLMULQDQ, AVX512BITALG, RDPID and
AVX512VPOPCNTDQ instruction set support.

‘icelake-server’
Intel Ice Lake Server CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3,
SSSE3, SSE4.1, SSE4.2, POPCNT, CX16, SAHF, FXSR, AVX, XSAVE, PCLMUL,
FSGSBASE, RDRND, F16C, AVX2, BMI, BMI2, LZCNT, FMA, MOVBE, HLE, RDSEED,
ADCX, PREFETCHW, AES, CLFLUSHOPT, XSAVEC, XSAVES, SGX, AVX512F, AVX512VL,
AVX512BW, AVX512DQ, AVX512CD, PKU, AVX512VBMI, AVX512IFMA, SHA, AVX512VNNI,
GFNI, VAES, AVX512VBMI2 , VPCLMULQDQ, AVX512BITALG, RDPID, AVX512VPOPCNTDQ,
PCONFIG, WBNOINVD and CLWB instruction set support.

VPCLMULQDQ does appear to be the correct check but __builtin_cpu_supports
("vpclmulqdq") is returning true (possibly as a misinterpretation of the
avx2 implementation?)

On Mon, 25 Nov 2024 at 18:59, Sam Russell <sam.h.russell@gmail.com> wrote:

> > Impressive. What CPU was that exactly.
>
> AMD Ryzen 5 5600 6-Core Processor
>
> > There is a copy/paste issue:
> > Also `make syntax-check` indicates some lines are > 80 chars.
> > This improvement should be added to NEWS.
>
> Thanks, will fix these
>
> > What compiler version are you using?
>
> $ gcc -v
> Using built-in specs.
> COLLECT_GCC=gcc
> COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-linux-gnu/13/lto-wrapper
> OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
> OFFLOAD_TARGET_DEFAULT=1
> Target: x86_64-linux-gnu
> Configured with: ../src/configure -v --with-pkgversion='Ubuntu
> 13.2.0-23ubuntu4' --with-bugurl=file:///usr/share/doc/gcc-13/README.Bugs
> --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr
> --with-gcc-major-version-only --program-suffix=-13
> --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
> --libexecdir=/usr/libexec --without-included-gettext --enable-threads=posix
> --libdir=/usr/lib --enable-nls --enable-clocale=gnu
> --enable-libstdcxx-debug --enable-libstdcxx-time=yes
> --with-default-libstdcxx-abi=new --enable-libstdcxx-backtrace
> --enable-gnu-unique-object --disable-vtable-verify --enable-plugin
> --enable-default-pie --with-system-zlib --enable-libphobos-checking=release
> --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch
> --disable-werror --enable-cet --with-arch-32=i686 --with-abi=m64
> --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic
> --enable-offload-targets=nvptx-none=/build/gcc-13-uJ7kn6/gcc-13-13.2.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-13-uJ7kn6/gcc-13-13.2.0/debian/tmp-gcn/usr
> --enable-offload-defaulted --without-cuda-driver --enable-checking=release
> --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
> Thread model: posix
> Supported LTO compression algorithms: zlib zstd
> gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4)
>
> > Can you show the output of `grep flags /proc/cpuinfo | head -n1` on the
> VM.
>
> I only spun it up for a few minutes to verify the app worked and then
> closed it down, this was lscpu when I ran it though (avx512f should mean
> that vpclmullqlqdq is supported)
>
> $ lscpu
> Architecture:             x86_64
>   CPU op-mode(s):         32-bit, 64-bit
>   Address sizes:          46 bits physical, 48 bits virtual
>   Byte Order:             Little Endian
> CPU(s):                   2
>   On-line CPU(s) list:    0,1
> Vendor ID:                GenuineIntel
>   Model name:             Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
>     CPU family:           6
>     Model:                85
>     Thread(s) per core:   2
>     Core(s) per socket:   1
>     Socket(s):            1
>     Stepping:             7
>     BogoMIPS:             4999.99
>     Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb
> rdtscp lm constant_tsc rep_good nopl xt
>                           opology nonstop_tsc cpuid tsc_known_freq pni
> pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt
> tsc_deadline_timer aes xsave avx f16c rdrand hypervis
>                           or lahf_lm abm 3dnowprefetch pti fsgsbase
> tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx
> smap clflushopt clwb avx512cd avx512bw avx51
>                           2vl xsaveopt xsavec xgetbv1 xsaves ida arat pku
> ospke
> Virtualization features:
>   Hypervisor vendor:      KVM
>   Virtualization type:    full
> Caches (sum of all):
>   L1d:                    32 KiB (1 instance)
>   L1i:                    32 KiB (1 instance)
>   L2:                     1 MiB (1 instance)
>   L3:                     35.8 MiB (1 instance)
> NUMA:
>   NUMA node(s):           1
>   NUMA node0 CPU(s):      0,1
> Vulnerabilities:
>   Gather data sampling:   Unknown: Dependent on hypervisor status
>   Itlb multihit:          KVM: Mitigation: VMX unsupported
>   L1tf:                   Mitigation; PTE Inversion
>   Mds:                    Vulnerable: Clear CPU buffers attempted, no
> microcode; SMT Host state unknown
>   Meltdown:               Mitigation; PTI
>   Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no
> microcode; SMT Host state unknown
>   Reg file data sampling: Not affected
>   Retbleed:               Vulnerable
>   Spec rstack overflow:   Not affected
>   Spec store bypass:      Vulnerable
>   Spectre v1:             Mitigation; usercopy/swapgs barriers and __user
> pointer sanitization
>   Spectre v2:             Mitigation; Retpolines; STIBP disabled; RSB
> filling; PBRSB-eIBRS Not affected; BHI Retpoline
>   Srbds:                  Not affected
>   Tsx async abort:        Not affected
>
> It's an EC2 T3.micro instance, and they specifically market their T3
> servers as being skylake and AVX512 capable
>
> GDB output from the crash:
>
> Dump of assembler code from 0x555555555de9 to 0x555555555e07:
> => 0x0000555555555de9 <cksum_avx512+2944>:      vpclmullqlqdq
> zmm0,zmm0,zmm1
>    0x0000555555555df0 <cksum_avx512+2951>:      vmovdqa64 ZMMWORD PTR
> [rsp+0x400],zmm0
>    0x0000555555555df8 <cksum_avx512+2959>:      vmovdqa64 zmm0,ZMMWORD PTR
> [rsp+0x200]
>    0x0000555555555e00 <cksum_avx512+2967>:      vmovdqa64 zmm1,ZMMWORD PTR
> [rsp+0x340]
>
> This is the first vpclmulqdq opcode in the binary.
>
> On further searching it appears that VPCLMULQDQ was only introduced in ice
> lake. We check for vpclmulqdq already but I'm assuming this only detects
> the AVX2 version, I'm not sure what flag we'd need to check to see if the
> AVX512 (EVEX) version is enabled.
>
> On Mon, 25 Nov 2024 at 18:37, Pádraig Brady <P@draigbrady.com> wrote:
>
>> On 25/11/2024 16:04, Sam Russell wrote:
>> > I've added a sample benchmarking program to measure the difference
>> without
>> > hitting disk, looking like a 40% speedup
>> >
>> > $ time ./cksum_bench_pclmul 1048576 10000
>> > Hash: EFA0B24F, length: 1048576
>> >
>> > real    0m3.018s
>> > user    0m3.018s
>> > sys     0m0.000s
>> >
>> > $ time ./cksum_bench_avx2 1048576 10000
>> > Hash: EFA0B24F, length: 1048576
>> >
>> > real    0m1.824s
>> > user    0m1.804s
>> > sys     0m0.020s
>>
>> Impressive. What CPU was that exactly.
>>
>> > The code effectively replicates the existing pclmul code and has new
>> > constants generated for the larger folds. The main gotcha was that the
>> > previous CRC gets inserted at a weird offset due to endianness and byte
>> > swapping.
>>
>> There is a copy/paste issue:
>>
>> diff --git a/src/cksum.c b/src/cksum.c
>> index 65424fe88..3eab1fbd4 100644
>> --- a/src/cksum.c
>> +++ b/src/cksum.c
>> @@ -186,8 +186,8 @@ avx512_supported (void)
>>     if (cksum_debug)
>>       error (0, 0, "%s",
>>              (avx512_enabled
>> -            ? _("using avx2 hardware support")
>> -            : _("avx2 support not detected")));
>> +            ? _("using avx512 hardware support")
>> +            : _("avx512 support not detected")));
>>
>>     return avx512_enabled;
>>   }
>>
>>
>> Also `make syntax-check` indicates some lines are > 80 chars.
>>
>> This improvement should be added to NEWS.
>>
>> > I don't have a skylake processor so I spun up an AWS instance to test
>> out
>> > the AVX512 version, it turns out there's a bug where virtualisation
>> > environments don't handle the  AVX512   pclmul correctly despite the CPU
>> > supporting it. It might be worth us disabling this for now as it does
>> get
>> > past the __builtin_cpu_supports() gate but then throws an illegal
>> > instruction halfway through the function. It would be nice if we could
>> at
>> > least validate it for now though.
>> >
>> > AVX2 has been around over 10 years though so this seems to be a safer
>> > addition.
>>
>> Yes, we'd have to leave the avx512 code disabled by default
>> if we couldn't find a way around this issue.
>> It's a surprising issue TBH.
>> What compiler version are you using?
>> Can you show the output of `grep flags /proc/cpuinfo | head -n1` on the
>> VM.
>> There was a gcc bug in this area, but that was a while ago.
>> Unlikely, but maybe it resurfaced with avx512?
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85100
>>
>> thanks!
>> Pádraig
>>
>

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell, 2024/11/25
- Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Pádraig Brady, 2024/11/25
  - Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell, 2024/11/25
    - Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell <=
- Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Jeffrey Walton, 2024/11/25
  - Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell, 2024/11/25
    - Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell, 2024/11/25
    - Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Jeffrey Walton, 2024/11/25
    - Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell, 2024/11/25
    - Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Pádraig Brady, 2024/11/25
    - Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell, 2024/11/26
    - Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Pádraig Brady, 2024/11/26
    - Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell, 2024/11/26
    - Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell, 2024/11/26

Prev by Date: Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup
Next by Date: Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup
Previous by thread: Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup
Next by thread: Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup
Index(es):
- Date
- Thread