Re: Add sort option to find

findutils-patches
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Add sort option to find

From:	Bernhard Voelker
Subject:	Re: Add sort option to find
Date:	Wed, 26 Aug 2020 10:13:28 +0200
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.11.0
On 2020-08-19 02:48, Diego Ongaro wrote:
> Hi,
> 
> Long time user, first time GNU contributor (I think). I'm submitting patches
> to add a sort option to find. It will cause find to sort files by name within
> each directory before processing them.

Thanks for the patch.
This is the kind of material which is perfect as a start for a discussion
on a concrete matter.

FWIW:
If we decide to incorporate this patch, then this is a non-trivial change
which requires to undergo the copyright assignment process to GNU.

> As I wrote in the info page, "the -s option is more efficient than piping
> large amounts of output of find into the sort command, [...]

Why exactly is that more efficient?  Which resources are saved by that?
CPU, RAM (*), time, ...?

(*) see tests below.

> [...] and it produces output
> incrementally rather than buffering it all. It's also more convenient when
> the output of a find command isn't line-oriented or the lines don't start
> with the filenames."

Indeed, it is a convenience option for the user.

> This is not a new idea. It's been in this project's TODO file in the context
> of updatedb since the year 2000.

If something stays in a TODO for such a long time, then it's worth 
re-considering
again whether the feature is worthwhile nowadays or not.
I don't see why using the specialized 'sort' in 'updatedb' would be wrong.

> FreeBSD has included the -s flag in its man
> page for find since FreeBSD 3.1 [1], which was released in 1999. Mac OS X
> picked it up from FreeBSD.

Ha, that is a good argument: existing implementations:

- FreeBSD:
  https://www.freebsd.org/cgi/man.cgi?find(1)
- NetBSD:
  https://netbsd.gw.com/cgi-bin/man-cgi?find++NetBSD-current
- OSX (not sure if this is the "official" source):
  https://ss64.com/osx/find.html

Some systems where 'find' does not have -s:

- OpenBSD does not have -s:
  https://man.openbsd.org/find.1
- Solaris 10:
  https://docs.oracle.com/cd/E26505_01/html/816-5165/find-1.html
- Solaris 11.4:
  https://docs.oracle.com/cd/E88353_01/html/E37839/find-1.html

> Sorting has also been proposed previously on this mailing list. Phil Miller
> submitted a patch in 2014 that used `-sort` as a predicate [2][3]. After some
> discussion about sorting directories by inode, Phil's proposal appears to
> have fallen through.

As far as I remember, there have been several kind of requirements for sorting.
Mainly, there have been discussions about sorting the output.  This is a 
completely
different matter than having a sorted order of processing in each directory.

> My proposal uses `-s` as a global option instead of Phil's `-sort` predicate.
> The advantages of `-s` are (a) compatibility with BSD and Mac OS X and
> (b) allowing users to easily create a shell alias for `find -s`.

According to the manuals of the existing implementations, the behavior in
the patch seems to be identical, and therefore seemingly giving compatibility.

> One thing I didn't do is update the `oldfind` command to respect `-s`. It
> looked like that would require a non-trivial change. Is `oldfind` still used?

'oldfind' only exists as ancient reference implementation.
It is not installed anymore, but only used in the tests.

> Another thing is I didn't do is update the `updatedb.sh` script to take
> advantage of `find -s`. I wanted to see if this change would be accepted
> first. I also wasn't sure if testing for `find -s` should happen at
> build-time (like `sort -z`) or at run-time.

Fine, one step after the other.

> Please see the attached patches:
> 
>     [PATCH 1/3] Add find -s (sort) global option

[Patch discussed inline here.]

> diff --git a/find/ftsfind.c b/find/ftsfind.c
> index 783148c5..aa27666c 100644
> --- a/find/ftsfind.c
> +++ b/find/ftsfind.c
> @@ -34,6 +34,7 @@
>  #include <errno.h>
>  #include <fcntl.h>
>  #include <inttypes.h>
> +#include <string.h>
>  #include <sys/stat.h>
>  #include <unistd.h>
>
> @@ -514,6 +515,10 @@ consider_visiting (FTS *p, FTSENT *ent)
>      }
>  }
>
> +static int compare(FTSENT const **a, FTSENT const **b) {
> +  assert ((*a)->fts_parent == (*b)->fts_parent);
> +  return strcoll((*a)->fts_name, (*b)->fts_name);
> +}

strcoll may fail in some locales, see `man 3p strcoll`:

  RETURN VALUE
    [...]
    On error, strcoll() may set errno, but no return value is reserved
    to indicate an error.

and

  ERRORS
       These functions may fail if:
       EINVAL The s1 or s2 arguments contain characters outside the domain of 
the collating sequence.

I would assume this could easily be triggered with some strange file names.
So that would have to be handled, e.g. by a fallback to 'strcmp'.
Anyway, the code has to ensure that 'errno' does not clobber later/other
processing.

> @@ -547,7 +552,7 @@ find (char *arg)
>    if (options.stay_on_filesystem)
>      ftsoptions |= FTS_XDEV;
>
> -  p = fts_open (arglist, ftsoptions, NULL);
> +  p = fts_open (arglist, ftsoptions, options.sort ? compare : NULL);

It is nice that it uses existing functionality in gnulib's FTS implementation.

OTOH this effectively defeats the purpose of FTS:

FTS without sorting can process arbitrarily large hierarchies and directories.
It does this by reading a maximum amount of directory entries, and the loop
using 'fts_read' will internally read the next chunk of entries on demand.


With sorting, FTS has to read all directory entries to be able to sort them
(before entering the fts_read loop).  See:

   https://git.sv.gnu.org/cgit/gnulib.git/tree/lib/fts.c?id=5667b92dd#n1336

The difference in consumed memory is dramatic.

Example:

  # Set up a test file system which can handle 10M files.
  $ sudo mount -t tmpfs -o nr_inodes=10m tmpfs /mnt
  $ sudo chmod 4777 /mnt

  # And then testing the regular run vs. the sorted run for
  # an increasing number of files.
  {
    echo "num-files | regular run | sorted run"; \
    echo "num-files | mem    time | mem time"; \
    for num in 10000 100000 200000 500000 1000000 2000000 4000000 8000000; do \
      mkdir testdir \
        && ( cd testdir \
               && seq -w $num | xargs touch \
               \
               && num=$( numfmt --to=si "$num" ) \
               \
               && regular=$(
                    /usr/bin/time -f "%M %E" \
                      ~/findutils/find/find . -false 2>&1 \
                      | numfmt --from-unit=1024 --field=1 --to=iec-i ) \
               \
               && sorted=$(
                    /usr/bin/time -f "%M %E" \
                      ~/findutils/find/find -s . -false 2>&1 \
                      | numfmt --from-unit=1024 --field=1 --to=iec-i ) \
               && echo "$num | $regular | $sorted" \
           ); \
      rm -rf testdir; \
    done; \
  } | column -t

(The purpose of -false is to suppress the default action -print;
 this should not influence the test results much.)

The result is a linear increase of RAM needed
(and an expectedly higher elapsed time):

  num-files  |  regular  run      |  sorted  run
  num-files  |  mem      time     |  mem     time
  10K        |  4.8Mi    0:00.00  |  5.0Mi   0:00.01
  100K       |  29Mi     0:00.03  |  30Mi    0:00.12
  200K       |  29Mi     0:00.06  |  58Mi    0:00.25
  500K       |  29Mi     0:00.12  |  140Mi   0:00.64
  1.0M       |  29Mi     0:00.24  |  277Mi   0:01.31
  2.0M       |  29Mi     0:00.47  |  552Mi   0:02.61
  4.0M       |  29Mi     0:00.95  |  1.1Gi   0:05.39
  8.0M       |  29Mi     0:01.85  |  2.2Gi   0:11.83

Now comparing the last run with (the 8 million files) with the
good old 'find | sort' combination:

  $ { /usr/bin/time -f "%M %E (find)" ~/findutils/find/find \
        | /usr/bin/time -f "%M %E (sort)" sort -o /dev/null; } 2>&1 \
      | numfmt --from-unit=1024 --field=1 --to=iec-i
  29Mi 0:03.05 (find)
  14Mi 0:03.61 (sort)

This means that -s is slower and eats a huge amount of memory for
big directories.


More on using the 'compar' function parameter of 'fts_open':
this gives control into the hands of gnulib.  While this is not bad
per se, it limits future, possible extensions of sorting.

With this implementation, one would have to stick to the limited
functionality of sorting the processing of the current directory
by file name.  I mean I can already hear users requesting to sort
by mtime, by size, by permission bits, etc. - "like 'ls' does".

And here's the crux: 'ls' does not use FTS, but implements the
readdir() loop and the sorting itself.
https://git.sv.gnu.org/cgit/coreutils.git/tree/src/ls.c?id=5b8161ee4dc9#n3742

Furthermore, as the FTS-based sorting is 'per-directory', it does not
fulfill other requests which wish to have sorting of the overall tree
or output.

>     [PATCH 2/3] find: Update docs for -s (sort)
>     [PATCH 3/3] find: Add test for sort

Thanks for those 2 patches as well.  They document and demonstrate the use
of the new feature.


My summary:

* Pro's:

  - It is a convenience option for the user.

  - It is a small and smart implementation using existing "fts_open(..., 
compar)".
    The estimated maintenance burden is therefore rather small.

  - It fulfills a long-standing feature request.

  - Other implementations have it.

* Neutral / not yet clarified:

  - Testing: possible side effects, or better said, influence to or from
    other options are not yet clear.
    Example: is there any relation to or from -depth?  Or to other use
    of the LC_* variables?

  - Compatibility to other implementations:
    FreeBSD is using glibc's FTS; does that do the sorting the same way
    as gnulib's FTS module does?

  - Is the original use-case for 'updatedb' (still) valid?
    From the point of view of "the UNIX toolbox", there's nothing wrong
    with 'updatedb' invoking 'sort'.  'sort' is the right tool and optimized
    for all sorts of sorting, e.g. regardless of the size of the input.

* Con's:

  - It fulfills a long-standing feature request, but in a certain way,
    i.e., other size,mtime,... driven sorting will not be possible
    or contradict the current 'per-directory' sorting.
    Users might expect a different kind of sorting.

  - strcoll does not produce stable results among different platforms;
    users may complain that the order on platform X is different
    than that on platform Y.  We cannot do anything about it.

  - The elapsed time for 'find -s' is by a small factor larger than
    that of the classic 'find | sort' method.

  - The surprisingly huge memory footprint.

I don't claim the above, and especially those 3 lists (Pro, to be clarified,
and Con) to be complete, though.

With all the above, I'm currently 70:30 against adding the feature.

Have a nice day,
Berny
[Prev in Thread]
Current Thread
[Next in Thread]
Add sort option to find, Diego Ongaro, 2020/08/18
- [PATCH 1/3] Add find -s (sort) global option, Diego Ongaro, 2020/08/18
- [PATCH 2/3] find: Update docs for -s (sort), Diego Ongaro, 2020/08/18
- [PATCH 3/3] find: Add test for sort, Diego Ongaro, 2020/08/18
- Re: Add sort option to find, Dale R. Worley, 2020/08/18
  - Re: Add sort option to find, Diego Ongaro, 2020/08/19
- Re: Add sort option to find, Bernhard Voelker <=
Prev by Date: Re: Add sort option to find
Previous by thread: Re: Add sort option to find
Index(es):
- Date
- Thread