[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: merge sort temporary files
From: |
Jonathan Baker |
Subject: |
Re: merge sort temporary files |
Date: |
Fri, 14 May 2004 00:39:10 -0700 |
User-agent: |
Mutt/1.4.1i |
Great! Look forward to seeing this in the distribution. Thanks,
--Jonathan
On Fri, May 14, 2004 at 12:01:15AM -0700, Paul Eggert wrote:
> Instead of adding a new option, I think I'd rather change 'sort' to
> cater to your (relatively common) case, rather than to the (relatively
> contrived) cases like `cat F | sort -m -o F - G' where people should
> know that they're getting into trouble anyway.
>
> Here's a proposed patch to solve your problem that way instead.
>
> 2004-05-13 Paul Eggert <address@hidden>
>
> Improve performance of `sort -m' on large files, at the cost of
> making some contrived examples unsafe. POSIX allows this
> optimization. Performance problem reported by Jonathan Baker in
> <http://mail.gnu.org/archive/html/bug-coreutils/2004-05/msg00071.html>.
>
> * src/sort.c (first_same_file): Do not treat input pipes
> differently from other files.
> * doc/coreutils.texi (sort invocation): Document that "sort -m -o F"
> might write F before reading all the input.
> * NEWS: Likewise.
>
> Index: NEWS
> ===================================================================
> RCS file: /home/meyering/coreutils/cu/NEWS,v
> retrieving revision 1.206
> diff -p -u -r1.206 NEWS
> --- NEWS 11 May 2004 16:48:42 -0000 1.206
> +++ NEWS 14 May 2004 06:35:30 -0000
> @@ -20,6 +20,12 @@ GNU coreutils NEWS
>
> ** New features
>
> + For efficiency, `sort -m' no longer copies input to a temporary file
> + merely because the input happens to come from a pipe. As a result,
> + some relatively-contrived examples like `cat F | sort -m -o F - G'
> + are no longer safe, as `sort' might start writing F before `cat' is
> + done reading it. This problem cannot occur unless `-m' is used.
> +
> pwd now works even when run from a working directory whose name
> is longer than PATH_MAX.
>
> Index: doc/coreutils.texi
> ===================================================================
> RCS file: /home/meyering/coreutils/cu/doc/coreutils.texi,v
> retrieving revision 1.180
> diff -p -u -r1.180 coreutils.texi
> --- doc/coreutils.texi 9 May 2004 19:42:19 -0000 1.180
> +++ doc/coreutils.texi 14 May 2004 06:32:53 -0000
> @@ -3265,9 +3265,13 @@ starting with 1. So to sort on the seco
> @opindex --output
> @cindex overwriting of input, allowed
> Write output to @var{output-file} instead of standard output.
> -If necessary, @command{sort} reads input before opening
> +Normally, @command{sort} reads all input before opening
> @var{output-file}, so you can safely sort a file in place by using
> commands like @code{sort -o F F} and @code{cat F | sort -o F}.
> +However, @command{sort} with @option{--merge} (@option{-m}) can open
> +the output file before reading all input, so a command like @code{cat
> +F | sort -m -o F - G} is not safe as @command{sort} might start
> +writing @file{F} before @command{cat} is done reading it.
>
> @vindex POSIXLY_CORRECT
> On newer systems, @option{-o} cannot appear after an input file if
> Index: src/sort.c
> ===================================================================
> RCS file: /home/meyering/coreutils/cu/src/sort.c,v
> retrieving revision 1.284
> diff -p -u -r1.284 sort.c
> --- src/sort.c 26 Apr 2004 15:37:33 -0000 1.284
> +++ src/sort.c 14 May 2004 05:45:52 -0000
> @@ -1878,9 +1878,7 @@ sortlines_temp (struct line *lines, size
> }
>
> /* Return the index of the first of NFILES FILES that is the same file
> - as OUTFILE. If none can be the same, return NFILES. Consider an
> - input pipe to be the same as OUTFILE, since the pipe might be the
> - output of a command like "cat OUTFILE". */
> + as OUTFILE. If none can be the same, return NFILES. */
>
> static int
> first_same_file (char * const *files, int nfiles, char const *outfile)
> @@ -1910,7 +1908,7 @@ first_same_file (char * const *files, in
> ? fstat (STDIN_FILENO, &instat)
> : stat (files[i], &instat))
> == 0)
> - && (S_ISFIFO (instat.st_mode) || SAME_INODE (instat, outstat)))
> + && SAME_INODE (instat, outstat))
> return i;
> }
>