Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup

From:	Sam Russell
Subject:	Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup
Date:	Mon, 25 Nov 2024 18:59:00 +0100

> Impressive. What CPU was that exactly.

AMD Ryzen 5 5600 6-Core Processor

> There is a copy/paste issue:
> Also `make syntax-check` indicates some lines are > 80 chars.
> This improvement should be added to NEWS.

Thanks, will fix these

> What compiler version are you using?

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-linux-gnu/13/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
13.2.0-23ubuntu4' --with-bugurl=file:///usr/share/doc/gcc-13/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-13
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/libexec --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-libstdcxx-backtrace
--enable-gnu-unique-object --disable-vtable-verify --enable-plugin
--enable-default-pie --with-system-zlib --enable-libphobos-checking=release
--with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch
--disable-werror --enable-cet --with-arch-32=i686 --with-abi=m64
--with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic
--enable-offload-targets=nvptx-none=/build/gcc-13-uJ7kn6/gcc-13-13.2.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-13-uJ7kn6/gcc-13-13.2.0/debian/tmp-gcn/usr
--enable-offload-defaulted --without-cuda-driver --enable-checking=release
--build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4)

> Can you show the output of `grep flags /proc/cpuinfo | head -n1` on the
VM.

I only spun it up for a few minutes to verify the app worked and then
closed it down, this was lscpu when I ran it though (avx512f should mean
that vpclmullqlqdq is supported)

$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   2
  On-line CPU(s) list:    0,1
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
    CPU family:           6
    Model:                85
    Thread(s) per core:   2
    Core(s) per socket:   1
    Socket(s):            1
    Stepping:             7
    BogoMIPS:             4999.99
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb
rdtscp lm constant_tsc rep_good nopl xt
                          opology nonstop_tsc cpuid tsc_known_freq pni
pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt
tsc_deadline_timer aes xsave avx f16c rdrand hypervis
                          or lahf_lm abm 3dnowprefetch pti fsgsbase
tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx
smap clflushopt clwb avx512cd avx512bw avx51
                          2vl xsaveopt xsavec xgetbv1 xsaves ida arat pku
ospke
Virtualization features:
  Hypervisor vendor:      KVM
  Virtualization type:    full
Caches (sum of all):
  L1d:                    32 KiB (1 instance)
  L1i:                    32 KiB (1 instance)
  L2:                     1 MiB (1 instance)
  L3:                     35.8 MiB (1 instance)
NUMA:
  NUMA node(s):           1
  NUMA node0 CPU(s):      0,1
Vulnerabilities:
  Gather data sampling:   Unknown: Dependent on hypervisor status
  Itlb multihit:          KVM: Mitigation: VMX unsupported
  L1tf:                   Mitigation; PTE Inversion
  Mds:                    Vulnerable: Clear CPU buffers attempted, no
microcode; SMT Host state unknown
  Meltdown:               Mitigation; PTI
  Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no
microcode; SMT Host state unknown
  Reg file data sampling: Not affected
  Retbleed:               Vulnerable
  Spec rstack overflow:   Not affected
  Spec store bypass:      Vulnerable
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user
pointer sanitization
  Spectre v2:             Mitigation; Retpolines; STIBP disabled; RSB
filling; PBRSB-eIBRS Not affected; BHI Retpoline
  Srbds:                  Not affected
  Tsx async abort:        Not affected

It's an EC2 T3.micro instance, and they specifically market their T3
servers as being skylake and AVX512 capable

GDB output from the crash:

Dump of assembler code from 0x555555555de9 to 0x555555555e07:
=> 0x0000555555555de9 <cksum_avx512+2944>:      vpclmullqlqdq zmm0,zmm0,zmm1
   0x0000555555555df0 <cksum_avx512+2951>:      vmovdqa64 ZMMWORD PTR
[rsp+0x400],zmm0
   0x0000555555555df8 <cksum_avx512+2959>:      vmovdqa64 zmm0,ZMMWORD PTR
[rsp+0x200]
   0x0000555555555e00 <cksum_avx512+2967>:      vmovdqa64 zmm1,ZMMWORD PTR
[rsp+0x340]

This is the first vpclmulqdq opcode in the binary.

On further searching it appears that VPCLMULQDQ was only introduced in ice
lake. We check for vpclmulqdq already but I'm assuming this only detects
the AVX2 version, I'm not sure what flag we'd need to check to see if the
AVX512 (EVEX) version is enabled.

On Mon, 25 Nov 2024 at 18:37, Pádraig Brady <P@draigbrady.com> wrote:

> On 25/11/2024 16:04, Sam Russell wrote:
> > I've added a sample benchmarking program to measure the difference
> without
> > hitting disk, looking like a 40% speedup
> >
> > $ time ./cksum_bench_pclmul 1048576 10000
> > Hash: EFA0B24F, length: 1048576
> >
> > real    0m3.018s
> > user    0m3.018s
> > sys     0m0.000s
> >
> > $ time ./cksum_bench_avx2 1048576 10000
> > Hash: EFA0B24F, length: 1048576
> >
> > real    0m1.824s
> > user    0m1.804s
> > sys     0m0.020s
>
> Impressive. What CPU was that exactly.
>
> > The code effectively replicates the existing pclmul code and has new
> > constants generated for the larger folds. The main gotcha was that the
> > previous CRC gets inserted at a weird offset due to endianness and byte
> > swapping.
>
> There is a copy/paste issue:
>
> diff --git a/src/cksum.c b/src/cksum.c
> index 65424fe88..3eab1fbd4 100644
> --- a/src/cksum.c
> +++ b/src/cksum.c
> @@ -186,8 +186,8 @@ avx512_supported (void)
>     if (cksum_debug)
>       error (0, 0, "%s",
>              (avx512_enabled
> -            ? _("using avx2 hardware support")
> -            : _("avx2 support not detected")));
> +            ? _("using avx512 hardware support")
> +            : _("avx512 support not detected")));
>
>     return avx512_enabled;
>   }
>
>
> Also `make syntax-check` indicates some lines are > 80 chars.
>
> This improvement should be added to NEWS.
>
> > I don't have a skylake processor so I spun up an AWS instance to test out
> > the AVX512 version, it turns out there's a bug where virtualisation
> > environments don't handle the  AVX512   pclmul correctly despite the CPU
> > supporting it. It might be worth us disabling this for now as it does get
> > past the __builtin_cpu_supports() gate but then throws an illegal
> > instruction halfway through the function. It would be nice if we could at
> > least validate it for now though.
> >
> > AVX2 has been around over 10 years though so this seems to be a safer
> > addition.
>
> Yes, we'd have to leave the avx512 code disabled by default
> if we couldn't find a way around this issue.
> It's a surprising issue TBH.
> What compiler version are you using?
> Can you show the output of `grep flags /proc/cpuinfo | head -n1` on the VM.
> There was a gcc bug in this area, but that was a while ago.
> Unlikely, but maybe it resurfaced with avx512?
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85100
>
> thanks!
> Pádraig
>

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell, 2024/11/25
- Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Pádraig Brady, 2024/11/25
  - Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell <=
    - Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell, 2024/11/25
- Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Jeffrey Walton, 2024/11/25
  - Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell, 2024/11/25
    - Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell, 2024/11/25
    - Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Jeffrey Walton, 2024/11/25
    - Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell, 2024/11/25
    - Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Pádraig Brady, 2024/11/25
    - Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell, 2024/11/26
    - Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Pádraig Brady, 2024/11/26
    - Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup, Sam Russell, 2024/11/26

Prev by Date: Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup
Next by Date: Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup
Previous by thread: Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup
Next by thread: Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup
Index(es):
- Date
- Thread