bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: new coreutil? shuffle - randomize file contents


From: Philip Rowlands
Subject: Re: new coreutil? shuffle - randomize file contents
Date: Thu, 2 Jun 2005 23:20:07 +0100 (BST)

On Thu, 2 Jun 2005, Frederik Eaton wrote:

>Phil> Is it that the app must guarantee all lines of a
>Phil> non-seekable stdin must have an equal chance of any sort order?
>
>See my comment to James above. I think one need not make this
>guarantee, since only a tiny fraction of possible sort orders will be
>able to be tried by the user. However, I think it would be true for my
>proposal (and the one that James suggested to Davis) if one took the
>random number generator or hash function to be variable or unspecified
>during analysis of the algorithm.

Thinking further on this, I don't think it matters to the guts of sort
whether the ultimate random key is based on position hash or PRNG.

>One possibility for an efficient random permutation of a large file is
>a program which scans the file once and records the location of each
>line in an index, then applies a uniformly distributed random
>permutation to this line index by stepping through it in order and
>swapping the current entry with an entry chosen at random from the set
>of later entries, and finally steps through the index again and
>dereferences each line entry and prints out the corresponding line in
>the original file.

That approach is fine on seekable files, but the user may wish to
shuffle from stdin. sort already knows how to do this.

Here's a concrete example of Paul's suggestion as I understand it:
---INPUT---
a 3
b 2
c 1
d 1
---INPUT---

sort -R (or -k R) creates an imaginary field and sorts only on that:
---IMAGINARY INPUT---
a 3 0x293a
b 2 0xc9f4
c 1 0xa932
d 1 0x5ef5
---IMAGINARY INPUT---

---SORTED IMAGINARY OUTPUT---
a 3 0x293a
d 1 0x5ef5
c 1 0xa932
b 2 0xc9f4
---SORTED IMAGINARY OUTPUT---

---ACTUAL OUTPUT---
a 3
d 1
c 1
b 2
---ACTUAL OUTPUT---


Of course the random key should be long enough to ensure a negligible
chance of repetition (128 bits?)

The user is free to make use of the random key and other keys to create
random subsets of an overall-sorted list.

I'm still interested to read what Paul considers to be the
difficulties of such an implementation?


Cheers,
Phil




reply via email to

[Prev in Thread] Current Thread [Next in Thread]