bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] sort: Add --threads option, which parallelizes internal sor


From: glen lenker
Subject: Re: [PATCH] sort: Add --threads option, which parallelizes internal sort.
Date: Fri, 19 Jun 2009 04:03:10 -0700

On Fri, Jun 12, 2009 at 12:54 AM, Ralf Wildenhues <address@hidden>wrote:

> * Jim Meyering wrote on Thu, May 28, 2009 at 09:33:21PM CEST:
> > Glen Lenker wrote:
> > > On Thu, Mar 26, 2009 at 09:50:08PM +0000, Ralf Wildenhues wrote:
> > >>
> > >> Example run, on an 8-way, and with cat'ed instances of the dictionary,
> > >> on tmpfs, timings best of three:
> > >
> > > Hey Ralf, did you happen to specify the amount of RAM sort should
> > > use. Not specifying enough RAM for sort would force break what would
> > > be a single internal sort into multiple internal sort passes and then
> > > an external sort. As it is external sort is still sequential.
>
> No, I did not specify the amount of RAM.  The system I tested on has
> plenty of RAM, way more than is needed for the sort.  Specifying
> something like 2G of RAM does not make any visible difference.
>

When testing on gcc16, 16G of RAM, sort chooses to use only 1G of
RAM for a 256M file, and ends up hitting the external merge code. It
was through experimentation with gprof that I found that anything less
than 2G causes sort to do an external sort on gcc16.


>
> > > I ran these tests on a 256MB instance of the dictionary in tmpfs, on a
> > > 8-core machine specifying 2G of RAM.
> > >
> > > runtime [s]     threads
> > > file size [MB]  1         2         4       8
> > > 256             2m41.219  1m27.357  52.53   36.429
>
> > > 160.22user 1.34system 2:41.61elapsed 99%CPU
> > > 159.83user 1.45system 1:27.12elapsed 185%CPU
> > > 159.84user 1.56system 0:52.26elapsed 308%CPU
> > > 160.67user 1.53system 0:36.26elapsed 447%CPU
> > >
> > > This seems to be what I would expect from a good implementation.
>
> Yes, 56% efficiency going from 1 to 8 threads sounds like a much better
> number, also your system overhead looks very sane compared to what I
> saw.  Seems like it's a system-specific issue after all.  Which Linux
> kernel version and which pthread (glibc) version are you using?
>

This was on gcc16.
Linux gcc16 2.6.18-6-vserver-amd64 #1 SMP Thu Dec 25 21:35:11 UTC 2008
x86_64 GNU/Linux
glibc 2.3.6.ds1-13etch9


>
> > >> It suggests to me that too much time is spent busy-waiting in
> pthread_join,
> > >> or that sort is computing too much (I haven't looked at the patch in
> detail).
> > >>
> > >> Also, I'd have expected the rate going from 1 to 2 threads to get at
> least
> > >> a bit better with bigger file size, but it remains remarkably
> constant,
> > >> around 1.45 for this setup.  What am I missing?
>
> Glen, could you look at this, too?  I mean just timings of 1 vs 2
> threads for different file sizes?  Thanks.
>

The data was again cat'ed instances of the dictionary, put onto tmpfs. I
made sure that each time it had enough memory.

1st run
memory    1 thread    2 threads   speedup
16M          7.27         4.04           ~1.8
32M          16.15        9.19          ~1.75
64M          35.06        19.61         ~1.78
128M        75.47        40.99         ~1.84
256M        162.30       87.27         ~1.86

2nd run
memory   1 thread    2 threads    speedup
16M        7.27          4.05            ~1.8
32M        16.15         8.86           ~1.82
64M        35.07         19.18          ~1.83
128M      75.22         41.28          ~1.82
256M      161.76       87.77           ~1.84


> > I ran some tests (see below), and got results that look
> > similar to Ralf's.  I used an otherwise-idle 16-core Nehalem-based
> > system with 8GB of RAM.
>
> Which kernel and glibc?
>

> > However, more importantly, while the 16- and 8-thread tests were
> > running, I sampled thread counts using ps and was surprised to see
> > the number of active threads was usually far less than the maximum.
>
> Well yes, I suppose you are seeing the serial overhead from reading in
> the data set by the first process alone, and from the later merging
> stages where only fewer threads are active.
>
> >   T=16             T is number of threads
> >   12.96       <--- This is elapsed wall-clock time in seconds.
> >   16.43
> >   13.76
> >   16.60
>
> That varies a lot.  Are there other jobs on this system running?
> Do you, Jim, also see high system time overhead?  That would support
> the hypothesis of a less efficient thread implementation (or kernel
> issue) being the cause.
>
> Off-list, Glen wrote:
> > I don't know if this is related, because up until now memory
> > consumption hasn't been mentioned wrt to this patch, but this patch
> > bumps the amount of memory consumed from 1.5x to 2x. Should we
> > consider reducing this?
>
> I don't think fighting over that 0.5 x size is worth any hassles,
> specifically as that x is just the number of lines times the size
> of a struct line, not the size of the data in one line (right?).
>
>
I was just concerned that the extra 0.5 x #lines x sizeof(struct line)
would push the amount of memory needed over the default amount
of memory allocated by sort. Ultimately an external merge where it
would normally be necessary. I suppose that this shouldn't matter
in the future as much as it matters now.

On this subject, what exactly is the motivation of sort's
default_sort_size()?
Shouldn't it try to do an internal sort if possible?

Glen


reply via email to

[Prev in Thread] Current Thread [Next in Thread]