bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] crc: Add PCLMUL implementation


From: Sam Russell
Subject: Re: [PATCH] crc: Add PCLMUL implementation
Date: Tue, 26 Nov 2024 22:59:18 +0100

> Thue use of _mm_loadu_si128 provides for unaligned byte arrays (that's
> the 'u' in the 'loadu'), so you will be Ok there, too.

Thanks Jeff, I wasn't going to push this with a "works for me" without knowing why. I'll remove the alignment code.

> I believe the way to zero a __m128i is using _mm_setzero_si128(). It

Perfect, it's outside the main loop so I'm not worried but this does look more correct.

On Tue, 26 Nov 2024 at 22:47, Jeffrey Walton <noloader@gmail.com> wrote:
On Tue, Nov 26, 2024 at 4:27 PM Sam Russell <sam.h.russell@gmail.com> wrote:
>
> I've added an alignment check in lib/crc, it looks like the code works okay without it for me but an _m128 is supposed to be 128-bit aligned so I'm happy that I've added it.

The _m128i's are naturally aligned. They will be ok:

+        in1 = _mm_loadu_si128(data);
+        in2 = _mm_loadu_si128(data + 1);
+        in3 = _mm_loadu_si128(data + 2);
+        in4 = _mm_loadu_si128(data + 3);

Thue use of _mm_loadu_si128 provides for unaligned byte arrays (that's
the 'u' in the 'loadu'), so you will be Ok there, too.

> The attached patch renames the module to crc-x86_64 while keeping the source file crc-x86_64-pclmul.c, as well as the alignment check above.
>
> test-crc and bench-crc are fine, gzip builds with this gnulib and decompresses my test file with no hash error

A quick  comment from the patch...

+    __m128i in1 = {0};
+    __m128i in2 = {0};
+    __m128i in3 = {0};
+    __m128i in4 = {0};

I believe the way to zero a __m128i is using _mm_setzero_si128(). It
performs a pxor, which is fast. So I would expect to see:

    __m128i in1 = _mm_setzero_si128();
    __m128i in2 = _mm_setzero_si128();
    __m128i in3 = _mm_setzero_si128();
    __m128i in4 = _mm_setzero_si128();

I'm not sure what the assignment does, and if it always generates the
best code. If the assignment loads a zeroized buffer using movdqa or
movdqu, then I would expect it to be slower than the pxor.

Also see <https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_setzero_si128>.

Jeff

reply via email to

[Prev in Thread] Current Thread [Next in Thread]