[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#40533: GNU sort could be faster on machines with more cores
From: |
Ole Tange |
Subject: |
bug#40533: GNU sort could be faster on machines with more cores |
Date: |
Fri, 10 Apr 2020 13:11:31 +0200 |
$ sort --version
sort (GNU coreutils) 8.28
I am the author of GNU Parallel and my fetish is to try to run more
stuff in parallel. I recently sorted a 2.4 GBytes/100 M lines file:
export LC_ALL=C
time sort --parallel=48 bigfile >/dev/null
This takes 87 seconds on my 48 core machine. Everything is done in a
RAM disk. By adjusting 48 I find the optimal is --parallel=46: 80
seconds.
But it is possible to do better than that.
A big part of the final part of the sorting is a single tread. It is
likely merging the results of the other threads, but it leaves the
other 47 cores idle.
So I thought: Can we parallelize that more?
The reasoning is: If you have N-1 cores sitting idle, you really do
not care if you waste some CPU cycles, if you can get the result
faster.
So I tested.
Let us start by looking at an easier case: We do not have 1 big file,
but the bigfile is split into 48 files of similar sizes.
In this case we can do:
sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort
file1) <(sort file2) ) <(sort -m <(sort file3) <(sort file4) ) )
<(sort -m <(sort -m <(sort file5) <(sort file6) ) <(sort -m <(sort
file7) <(sort file8) ) ) ) <(sort -m <(sort -m <(sort -m <(sort file9)
<(sort file10) ) <(sort -m <(sort file11) <(sort file12) ) ) <(sort -m
<(sort -m <(sort file13) <(sort file14) ) <(sort -m <(sort file15)
<(sort file16) ) ) ) ) <(sort -m <(sort -m <(sort -m <(sort -m <(sort
file17) <(sort file18) ) <(sort -m <(sort file19) <(sort file20) ) )
<(sort -m <(sort -m <(sort file21) <(sort file22) ) <(sort -m <(sort
file23) <(sort file24) ) ) ) <(sort -m <(sort -m <(sort -m <(sort
file25) <(sort file26) ) <(sort -m <(sort file27) <(sort file28) ) )
<(sort -m <(sort -m <(sort file29) <(sort file30) ) <(sort -m <(sort
file31) <(sort file32) ) ) ) ) ) <(sort -m <(sort -m <(sort -m <(sort
-m <(sort -m <(sort file33) <(sort file34) ) <(sort -m <(sort file35)
<(sort file36) ) ) <(sort -m <(sort -m <(sort file37) <(sort file38) )
<(sort -m <(sort file39) <(sort file40) ) ) ) <(sort -m <(sort -m
<(sort -m <(sort file41) <(sort file42) ) <(sort -m <(sort file43)
<(sort file44) ) ) <(sort -m <(sort -m <(sort file45) <(sort file46) )
<(sort -m <(sort file47) <(sort file48) ) ) ) ) )
Or indented:
sort -m \
<(sort -m \
<(sort -m \
<(sort -m \
<(sort -m \
<(sort -m <(sort file1) <(sort file2) )\
<(sort -m <(sort file3) <(sort file4) ) )\
<(sort -m \
<(sort -m <(sort file5) <(sort file6) )\
<(sort -m <(sort file7) <(sort file8) ) ) )\
<(sort -m \
<(sort -m \
<(sort -m <(sort file9) <(sort file10) )\
<(sort -m <(sort file11) <(sort file12) ) )\
<(sort -m \
<(sort -m <(sort file13) <(sort file14) )\
<(sort -m <(sort file15) <(sort
file16) ) ) ) )\
<(sort -m \
<(sort -m \
<(sort -m \
<(sort -m <(sort file17) <(sort file18) )\
<(sort -m <(sort file19) <(sort file20) ) )\
<(sort -m <(sort -m <(sort file21) <(sort file22) )\
<(sort -m <(sort file23) <(sort file24) ) ) )\
<(sort -m \
<(sort -m \
<(sort -m <(sort file25) <(sort file26) )\
<(sort -m <(sort file27) <(sort file28) ) )\
<(sort -m \
<(sort -m <(sort file29) <(sort file30) )\
<(sort -m <(sort file31) <(sort
file32) ) ) ) ) ) \
<(sort -m \
<(sort -m \
<(sort -m \
<(sort -m \
<(sort -m <(sort file33) <(sort file34) )\
<(sort -m <(sort file35) <(sort file36) ) )\
<(sort -m <(sort -m <(sort file37) <(sort file38) )\
<(sort -m <(sort file39) <(sort file40) ) ) )\
<(sort -m \
<(sort -m \
<(sort -m <(sort file41) <(sort file42) )\
<(sort -m <(sort file43) <(sort file44) ) )\
<(sort -m \
<(sort -m <(sort file45) <(sort file46) )\
<(sort -m <(sort file47) <(sort
file48) ) ) ) ) )
So WTF is going on here?
Here we do an initial sort of each of the 48 files. Then we do a merge
sort of a pair of those. Then we do a merge sort of a pair of those.
Then we do a merge sort of a pair of those. Then we do a merge sort of
a pair of those. Then we do a merge sort of a pair of those. And so
on, until we only have a single input. All of this is done in parallel
when possible.
This takes 60 seconds.
But a lot of time is waiting for pipes. This can be lowered by added a
buffer after each merge (mbuffer -m 30M):
sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort
file1) <(sort file2) | mbuffer -m 30M ) <(sort -m <(sort file3) <(sort
file4) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m
<(sort file5) <(sort file6) | mbuffer -m 30M ) <(sort -m <(sort file7)
<(sort file8) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M )
<(sort -m <(sort -m <(sort -m <(sort file9) <(sort file10) | mbuffer
-m 30M ) <(sort -m <(sort file11) <(sort file12) | mbuffer -m 30M ) |
mbuffer -m 30M ) <(sort -m <(sort -m <(sort file13) <(sort file14) |
mbuffer -m 30M ) <(sort -m <(sort file15) <(sort file16) | mbuffer -m
30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort
-m <(sort -m <(sort -m <(sort -m <(sort file17) <(sort file18) |
mbuffer -m 30M ) <(sort -m <(sort file19) <(sort file20) | mbuffer -m
30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort file21) <(sort
file22) | mbuffer -m 30M ) <(sort -m <(sort file23) <(sort file24) |
mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m
<(sort -m <(sort -m <(sort file25) <(sort file26) | mbuffer -m 30M )
<(sort -m <(sort file27) <(sort file28) | mbuffer -m 30M ) | mbuffer
-m 30M ) <(sort -m <(sort -m <(sort file29) <(sort file30) | mbuffer
-m 30M ) <(sort -m <(sort file31) <(sort file32) | mbuffer -m 30M ) |
mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m
30M ) <(sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort file33)
<(sort file34) | mbuffer -m 30M ) <(sort -m <(sort file35) <(sort
file36) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m
<(sort file37) <(sort file38) | mbuffer -m 30M ) <(sort -m <(sort
file39) <(sort file40) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer
-m 30M ) <(sort -m <(sort -m <(sort -m <(sort file41) <(sort file42) |
mbuffer -m 30M ) <(sort -m <(sort file43) <(sort file44) | mbuffer -m
30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort file45) <(sort
file46) | mbuffer -m 30M ) <(sort -m <(sort file47) <(sort file48) |
mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m
30M ) | mbuffer -m 30M )
This takes 24 seconds.
This is including overhead of starting multiple sorts and mbuffers,
and moving everything through pipes.
If this was in a single process with a single address space, I imagine
this would be even faster: Instead of moving data around, it should be
possible to move only pointers around.
It may also be possible to optimize the merge sort when there are only
two inputs:
i1 = read(input1)
i2 = read(input2)
while(input1 and input2) {
while(i1 <= i2) {
print i1
i1 = read(input1)
}
while(i2 <= i1) {
print i2
i2 = read(input2)
}
}
But I cheated a little: I started with 48 evenly sized files.
I think this could easily be emulated by having a single reader thread
spread blocks of full lines in a round robin style to the 48 threads
that will do the initial sorting. Something like:
while(read_block()) {
find_last_newline();
write_block_to_process(n%48);
n++;
}
If we are reading from a disk, then the reading may be slow, so if the
48 initial sorters can start sorting without having the full input,
then we will save some wall clock time.
/Ole
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- bug#40533: GNU sort could be faster on machines with more cores,
Ole Tange <=