bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: new coreutil? shuffle - randomize file contents


From: Jim Meyering
Subject: Re: new coreutil? shuffle - randomize file contents
Date: Thu, 02 Jun 2005 12:31:47 +0200

Frederik Eaton <address@hidden> wrote:
> So, what is the current state of things? Who is in charge of accepting
> patches? Are we decided that a 'shuffle' command but no 'sort -R'
> facility would be best, or that it would be good to have both, or is
> it still in question whether either would be accepted?

I am the official `maintainer', but Paul Eggert has been making
most of the changes recently.

It looks like there are some desirable features that can be
provided only by a shuffle-enabled program that is key-aware.
Key specification and the comparison code are already part of sort.
Obviously, duplicating all of that in a separate program is not
an option.  I don't relish the idea of factoring out sort's line-
and key-handling code either, but it might be feasible.

However, I do like the idea of a new program that simply outputs
a random permutation of its input records, and that does it well,
and repeatably.  The Unix tool philosophy certainly does encourage
the `perform one task and do it well' approach.  Since doing it
well includes handling input larger than available virtual memory,
this is not trivial -- and it is well suited to the coreutils,
i.e., it's not easily implementable as a script.

Initially, I was inclined to say that adding both the new program
(no key support) and related functionality to sort was desirable.
Thinking of the limits of robustness of such a new program, I
realized that if the input is sufficiently large and not seekable
(e.g., from a pipe), then the program will have to resort to writing
temporary files, much as sort already does.  More duplicated effort,
determining how much memory to use (like sort's --buffer-size=SIZE
option), managing the temporary files, ensuring that they're removed
upon interrupt, etc.  But maybe not prohibitive.  The new program
would also have to have an option like sort's -z, --zero-terminated
option, and --temporary-directory=DIR, and --output=FILE.  In effect,
it would need all of sort's options that don't relate to sorting.

So implementing a robust shuffle program, even one without key
handling capabilities, would require much of the infrastructure
already present in sort.c.

It sure sounds like shuffle and sort should share a lot of code,
one way or another, so why not have them share the line- and key-
handling code, too?  I won't rule out adding a new program, like
shuffle, but I confess I'm less inclined now than when I started
typing this message.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]