coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup


From: Jeffrey Walton
Subject: Re: [PATCH] cksum: Use AVX2 and AVX512 for speedup
Date: Mon, 25 Nov 2024 13:17:00 -0500

On Mon, Nov 25, 2024 at 11:09 AM Sam Russell <sam.h.russell@gmail.com> wrote:
>
> I've added a sample benchmarking program to measure the difference without
> hitting disk, looking like a 40% speedup
>
> $ time ./cksum_bench_pclmul 1048576 10000
> Hash: EFA0B24F, length: 1048576
>
> real    0m3.018s
> user    0m3.018s
> sys     0m0.000s
>
> $ time ./cksum_bench_avx2 1048576 10000
> Hash: EFA0B24F, length: 1048576
>
> real    0m1.824s
> user    0m1.804s
> sys     0m0.020s
>
> The code effectively replicates the existing pclmul code and has new
> constants generated for the larger folds. The main gotcha was that the
> previous CRC gets inserted at a weird offset due to endianness and byte
> swapping.
>
> I don't have a skylake processor so I spun up an AWS instance to test out
> the AVX512 version, it turns out there's a bug where virtualisation
> environments don't handle the  AVX512   pclmul correctly despite the CPU
> supporting it.

Skylake has AVX and AVX2; not AVX512.

I can provide you remote access to an Icelake machine with AVX512.
Email me your authorized_keys file, and I'll send you the login
information.

For completeness, here are Icelake's feature flags:

   $ cat /proc/cpuinfo | fold -w 72 -s
   processor       : 0
   vendor_id       : GenuineIntel
   cpu family      : 6
   model           : 126
   model name      : Intel(R) Core(TM) i5-1035G1 CPU @ 1.00GHz
   stepping        : 5
   ...
   cpuid level     : 27
   flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
   mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
   syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
   rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni
   pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr
   pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes
   xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd
   ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad
   fsgsbase tsc_adjust sgx bmi1 avx2 smep bmi2 erms invpcid avx512f
   avx512dq rdseed adx smap avx512ifma clflushopt intel_pt avx512cd sha_ni
   avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves split_lock_detect
   dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
   hwp_pkg_req vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes
   vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid sgx_lc fsrm
   md_clear flush_l1d arch_capabilities

> It might be worth us disabling this for now as it does get
> past the __builtin_cpu_supports() gate but then throws an illegal
> instruction halfway through the function. It would be nice if we could at
> least validate it for now though.
>
> AVX2 has been around over 10 years though so this seems to be a safer
> addition.

Jeff



reply via email to

[Prev in Thread] Current Thread [Next in Thread]