[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: new coreutil? shuffle - randomize file contents
From: |
Jim Meyering |
Subject: |
Re: new coreutil? shuffle - randomize file contents |
Date: |
Thu, 02 Jun 2005 12:31:47 +0200 |
Frederik Eaton <address@hidden> wrote:
> So, what is the current state of things? Who is in charge of accepting
> patches? Are we decided that a 'shuffle' command but no 'sort -R'
> facility would be best, or that it would be good to have both, or is
> it still in question whether either would be accepted?
I am the official `maintainer', but Paul Eggert has been making
most of the changes recently.
It looks like there are some desirable features that can be
provided only by a shuffle-enabled program that is key-aware.
Key specification and the comparison code are already part of sort.
Obviously, duplicating all of that in a separate program is not
an option. I don't relish the idea of factoring out sort's line-
and key-handling code either, but it might be feasible.
However, I do like the idea of a new program that simply outputs
a random permutation of its input records, and that does it well,
and repeatably. The Unix tool philosophy certainly does encourage
the `perform one task and do it well' approach. Since doing it
well includes handling input larger than available virtual memory,
this is not trivial -- and it is well suited to the coreutils,
i.e., it's not easily implementable as a script.
Initially, I was inclined to say that adding both the new program
(no key support) and related functionality to sort was desirable.
Thinking of the limits of robustness of such a new program, I
realized that if the input is sufficiently large and not seekable
(e.g., from a pipe), then the program will have to resort to writing
temporary files, much as sort already does. More duplicated effort,
determining how much memory to use (like sort's --buffer-size=SIZE
option), managing the temporary files, ensuring that they're removed
upon interrupt, etc. But maybe not prohibitive. The new program
would also have to have an option like sort's -z, --zero-terminated
option, and --temporary-directory=DIR, and --output=FILE. In effect,
it would need all of sort's options that don't relate to sorting.
So implementing a robust shuffle program, even one without key
handling capabilities, would require much of the infrastructure
already present in sort.c.
It sure sounds like shuffle and sort should share a lot of code,
one way or another, so why not have them share the line- and key-
handling code, too? I won't rule out adding a new program, like
shuffle, but I confess I'm less inclined now than when I started
typing this message.
- Re: new coreutil? shuffle - randomize file contents, (continued)
- Re: new coreutil? shuffle - randomize file contents, James Youngman, 2005/06/03
- Re: new coreutil? shuffle - randomize file contents, Davis Houlton, 2005/06/03
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/06/04
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/06/05
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/06/05
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/06/06
- Re: new coreutil? shuffle - randomize file contents, Jim Meyering, 2005/06/07
Re: new coreutil? shuffle - randomize file contents,
Jim Meyering <=
Re: new coreutil? shuffle - randomize file contents, David Feuer, 2005/06/02
Re: new coreutil? shuffle - randomize file contents, David Feuer, 2005/06/02
Re: new coreutil? shuffle - randomize file contents, David Feuer, 2005/06/02