coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] wc: speed-up by simplifying avx code


From: Evgeny Nizhibitsky
Subject: Re: [PATCH] wc: speed-up by simplifying avx code
Date: Sun, 31 Mar 2024 17:58:31 +0000

Yes, it's true that simplifying and speeding-up by the bufsize increase are
two different things although the former allowed the latter.

I just landed more tests with hyperfine for various configurations spanning
over the current master version and a new approach with a range of bufsizes
from 16 KiB up to 1 MiB, running on 1 billion yes'es like you did (1by), a
generated file for the recent 1 billion row challenge (1brc, with entries
like "<station name>;<temperature:0.2f>") and the first 100 million rows
for both of them (100my and 100mrc, respectively), all in /dev/shm, yet
again with 7800X3D:

The reported timings are as follows:

| version | 100my | 100mrc | 1by | 1brc |
| ------- | ------- | ------- | ------- | ------- |
| master | 21.3 ms ± 1.0 ms | 163.1 ms ± 1.5 ms | 197.1 ms ± 3.0 ms | 1.680
s ±  0.010 s |
| 16 KiB | 21.0 ms ± 1.1 ms | 163.7 ms ± 2.1 ms | 194.3 ms ± 2.5 ms | 1.658
s ± 0.015 s |
| 32 KiB | 20.2 ms ± 0.7 ms | 158.9 ms ± 3.0 ms | 194.6 ms ± 6.4 ms | 1.620
s ± 0.023 s |
| 64 KiB | 19.8 ms ± 0.6 ms | 154.0 ms ± 5.3 ms | 187.5 ms ± 7.2 ms | 1.553
s ± 0.013 s |
| 128 KiB | 18.8 ms ± 0.6 ms | 148.9 ms ± 5.4 ms | 178.4 ms ± 1.3 ms |
1.530 s ± 0.013 s |
| 256 KiB | 19.2 ms ± 0.8 ms | 145.8 ms ± 1.5 ms | 176.4 ms ± 1.6 ms |
1.522 s ± 0.017 s |
| 512 KiB | 19.6 ms ± 0.7 ms | 146.4 ms ± 1.0 ms | 183.0 ms ± 5.0 ms |
1.512 s ± 0.014 s |
| 1 MiB | 19.3 ms ± 0.7 ms | 145.7 ms ± 1.8 ms | 188.4 ms ± 6.2 ms | 1.499
s ± 0.012 s |

And the corresponding speed-up values are as follows:

| version | 100my | 100mrc | 1by | 1brc |
| ------- | ------- | ------- | ------- | ------- |
| master | 0% | 0% | 0% | 0% |
| 16 KiB | 1% | 0% | 1% | 1% |
| 32 KiB | 5% | 3% | 1% | 4% |
| 64 KiB | 8% | 6% | 5% | 8% |
| 128 KiB | 13% | 10% | 10% | 10% |
| 256 KiB | 11% | 12% | 12% | 10% |
| 512 KiB | 9% | 11% | 8% | 11% |
| 1 MiB | 10% | 12% | 5% | 12% |

So again in my case the new approach is on par with the old one while the
sweet spot bufsize of 256 KB seems to bring the best value.

Still more testing on different CPUs and sample files should probably be
conducted.

Regards,
Evgeny

On Sun, 31 Mar 2024 at 18:17, Pádraig Brady <P@draigbrady.com> wrote:

> On 31/03/2024 13:12, Pádraig Brady wrote:
> > On 31/03/2024 00:18, Evgeny Nizhibitsky wrote:
> >> Here is the proposed patch for both simplifying and consistently
> speeding up the avx version of wc -l by 10% in up to 1 billion rows
> scenarios on 7800X3D (probably should be tested on different data samples
> and CPUs).
> >
> > The patch was mangled, but I manually applied it.
> > Probably best to attach rather than pasting any further patches.
> > Attaching here in case others want to try.
> >
> > This is good as it simplifies the code,
> > and should have the same portability, to machines and compilers.
> > I'll adjust the configure.ac check to be more aligned.
> >
> > As for performance, I tested on my laptop with no change:
> >
> >     # on an i7-5600U with 1 billion short lines
> >     $ yes | head -n1000000000 > /dev/shm/yes
> >
> >     $ time src/wc-old -l /dev/shm/yes
> >     1000000000 /dev/shm/yes
> >     real    0m0.351s
> >     user    0m0.060s
> >     sys     0m0.288s
> >
> >     $ time src/wc-new -l /dev/shm/yes
> >     1000000000 /dev/shm/yes
> >     real    0m0.356s
> >     user    0m0.098s
> >     sys     0m0.255s
> >
> > Since you change the I/O size from 16 to 256 KiB,
> > it's more aligned with the recent I/O size adjustment in:
> > https://github.com/coreutils/coreutils/commit/fcfba90d0
> > In fact perhaps much of the speedup is just from that change.
> > Can you test on your system with the buffer reduced back to 16KiB
> > to see how much that impacts the performance?
>
> Oh I see you commented in the code that the 10-15% speed-up
> was due to the buffer size change.
>
> In testing more on my i7-5600U laptop, shows the 16 -> 256 KiB change
> improved performance by about 5%.  On the other hand, the new logic
> is about 5% slower on my laptop, cancelling out any win.
>
> The new code is simpler though, so still a win in that regard.
> I'll test on a few more platforms for comparison.
>
> cheers,
> Pádraig
>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]