Re: sort --random-sort

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: sort --random-sort

From:	Frederik Eaton
Subject:	Re: sort --random-sort
Date:	Sat, 3 Dec 2005 16:16:09 +0000
User-agent:	mutt-ng/devel-r472 (Debian)

Paul,

Thank you for submitting a patch of your own.

> > I still think what I sent is ready to apply.
> 
> I took a look at the patch quoted in that email and have some comments
> and suggestions.
> 
> It's more conservative to minimize the changes to the ISAAC code.  The
> simplest thing to do for now is to put it into a rand-isaac.c file,
> which is included both by shred.c and by sort.c.  This results in a
> trivial change to shred.c (remove a block of code and replace it by an
> #include of a copy of that code) and greater confidence that we
> haven't messed up the random-number generator somehow.  I didn't quite
> follow all the ISAAC_WORDS stuff, but since the patch you sent uses
> the same value for both sort and shred, we might as well leave it a
> constant, to minimize the number of changes to the code.

Jim agreed that the way I have chosen to share the ISAAC code would be
suitable. I have tested it and it produces the same random values as
before, when initialized with the same number of words and the same
seed.

> Eventually, once we factor in /dev/random / /dev/urandom stuff, this
> can be revisited, but in the meantime it's better to minimize changes
> to the code.

I think my changes were small. I tried the #include method but it
generated a bunch of preprocessor errors which I wasn't able to fix. 
It appears that you have managed to eliminate them. I still think that
my method is cleaner since it relies on linking rather than direct
inclusion, but if you are indeed going to remove the ISAAC code then I
suppose it doesn't matter. However, what do you plan to replace ISAAC
with in sort?

> As a documentation issue, I'd feel more comfortable calling this a
> "hashed" sort rather than a "random" sort.  Using the word "random"
> has the wrong connotations, since the sort doesn't generate a random
> permutation of the input.  So I would prefer option names like
> --hash-sort and -h.

I assume you are referring to the same-keys-sort-together issue, and
not the randomness-vs-pseudorandomness issue.

It is true that we are sorting via a hash, but that is an
implementation detail. The effect is to create a random total ordering
on the keys, and sort via this ordering. Because it is a total
ordering, keys which are the same must be equal. Yet calling the
option --random-total-order-sort would be redundant because all
sorting in the 'sort' command is done with respect to a total order. 
(even if the equivalence classes are not singletons, as with "3" ==
"3.0" for numeric sorting, it is still a total order) You can look up
"total order" on Wikipedia, for instance.

Also, as we've discussed, and as you've argued, an alternative option
which caused same-keys-sort-apart behavior probably doesn't belong in
the sort command - it wouldn't be meaningful to specify a "key" with
it, and it wouldn't interact with any of the other sorting methods. So
there is no potential for confusing the --random-sort option with a
possible future variant.

So I would like to leave the name as it is. I hope my reasoning is
clear to you.

> When two different keys hash to the same value, the comparison should
> fall back on byte-by-byte comparison of the keys; otherwise the user
> can't rely on identical keys sorting next to each other, which is one
> of the desired properties here.

Can you give an example of two such keys? In any case, I've changed my
local version to do this.

> I don't see why get_hash needs to extract a subset of data from the mm
> buffer.  Why not just compare the contents of the entire buffer?  Is
> there something nonrandom about that?  If so, perhaps we should go the
> whole hog and use the "correct" way to get at it.  (*)

What I've implemented is fast and it works. I'm sure there are other
ways of doing it, but looking at the ISAAC code, I think that what
I've done uses suitably random numbers.

> The usual style is to use NULL rather than 0 for the null pointer.

Fixed locally.

> Here's a proposed patch embodying the above suggestions.  I'm still a
> bit puzzled about the part marked (*) above, though.

Here are the results for shuffling /usr/share/dict/words:

your version:
./inst4/bin/sort -h < /usr/share/dict/words > /dev/null  14.04s user 0.01s 
system 99% cpu 14.168 total

my version:
./inst1/bin/sort -R < /usr/share/dict/words > /dev/null  1.23s user 0.01s 
system 99% cpu 1.243 total

My version is over 11 times faster on this example. (Also, doing a
normal sort takes about 0.7 seconds, so you can see where the lower
limit is)

However, if you fix the speed problem - which should be easy - and
change the option names back to that of my original proposal, and
change my email address in the ChangeLog back to address@hidden
(which is more permanent), then I would be happy with the patch that
you have submitted.

Frederik

[Prev in Thread]

Current Thread

[Next in Thread]

Re: sort --random-sort, Andreas Schwab, 2005/12/01
- Re: sort --random-sort, Frederik Eaton, 2005/12/02
  - Re: sort --random-sort, Paul Eggert, 2005/12/03
    - Re: sort --random-sort, Frederik Eaton <=
    - Re: sort --random-sort, Frederik Eaton, 2005/12/03
    - Re: sort --random-sort, Paul Eggert, 2005/12/10
    - Re: sort --random-sort, Jim Meyering, 2005/12/10
    - Re: sort --random-sort, Jim Meyering, 2005/12/03
- Re: sort --random-sort, Eric Blake, 2005/12/10
  - Re: sort --random-sort, Jim Meyering, 2005/12/10

Prev by Date: Re: sort --random-sort
Next by Date: Feature request: add option to show directories first
Previous by thread: Re: sort --random-sort
Next by thread: Re: sort --random-sort
Index(es):
- Date
- Thread