Re: Add sort option to find

findutils-patches

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Add sort option to find

From:	Diego Ongaro
Subject:	Re: Add sort option to find
Date:	Wed, 2 Sep 2020 21:44:34 -0700

On Wed, Sep 2, 2020 at 7:22 PM Dale R. Worley <worley@alum.mit.edu> wrote:
>
> Bernhard Voelker <mail@bernhard-voelker.de> writes:
> >> As I wrote in the info page, "the -s option is more efficient than piping
> >> large amounts of output of find into the sort command, [...]
> >
> > Why exactly is that more efficient?  Which resources are saved by that?
> > CPU, RAM (*), time, ...?
>
> "find -s" requires find to store, at any one time, the sorted entries of
> all of the directories on the path above the current directory.  But
> "find | sort" requires the entire output of find to be stored until find
> completes.  In a sort-of-typical-asymptotic way the second is
> exponentially larger than the first.  If you made a tree of directories
> containing two entries, 20 layers deep, there would be a million entries
> to be sorted, but find -s would only need to keep track of 40 at any one
> time.

Thanks for the thorough and thoughtful feedback, Berny, and for your useful
example, Dale. I meant algorithmic efficiency in my original claim, but I see
how my wording was ambiguous. Intuitively, sorting names in each directory has
two algorithmic wins: (1) each file must be compared with only its siblings
instead of the entire directory tree, and (2) only names are compared instead
of full paths.

Berny's shell script example is exactly the worst-case scenario. If you put all
your files in one directory, `find -s` will invoke `qsort()` only once, so
that's O(N log N), just like `find | sort`.

Dale's example was a much better scenario for `find -s`. In general, consider
the best case of a perfectly balanced K-ary tree consisting of N total files
and directories. `find -s` needs O(K logK) comparisons per directory, and there
are (N - 1) / k directories. It might be cheating, but if we assume K is
constant, that comes out to O(N) comparisons total. So in this best case
filesystem layout, sorting each directory instead of sorting the output is a
significant algorithmic win.

Most real users don't have the patience to perfectly balance their filesystems.
I ran some stats on my own machine to get a better sense of one real user's
branching factor.

Here are some basic counts:

        files     dirs       both
root  252,668   22,973    275,641
home  669,986  116,673    786,659
both  922,654  139,646  1,062,300

And for each of the directories, how many entries they have:

           root      home
mean      12.96      6.78
std       99.47     76.36
min        0.00      0.00
25%        1.00      1.00
50%        2.00      2.00
75%        6.00      5.00
90%       18.00     11.00
99%      175.28     69.00
99.9%    994.00    404.00
99.99%  3060.29   1825.95
max     9184.00  21561.00

For those of you following along, this seems to be a cheap and decent
approximation of the average branching factor of a directory tree:

echo $(find | wc -l) / $(find -type d | wc -l) | bc -l

I only have 18 directories with more than 2,000 entries. My largest is my
Firefox profile (21,561 entries). The next is /var/lib/dpkg/info (9,184
entries).

So based on just my own filesystem, maybe something in the ballpark of N=1M and
K=10 is a reasonable guess for real world users. I think that's encouraging.

I also counted the actual number of calls to the compare() function with
`find -s`. It varies a little from run to run, but here's an example:

      comparisons  total files  comparisons/files
root    1,466,827      275,641               5.32
home    3,342,447      786,659               4.25
both    4,809,347    1,062,300               4.53

That, too, is encouraging: we only need about 5 string comparisons per file to
sort my entire filesystem, and I wouldn't expect that number to grow much for
larger filesystems.

Here's the equivalent table for `find | sort`:

      comparisons  total files  comparisons/files
root    4,049,248      275,641              14.69
home   11,254,537      786,659              14.30
both   15,349,917    1,062,300              14.45

It needs many more comparisons even at this scale, and I'd expect it to scale
more poorly with larger filesystems.

I also looked at path lengths vs name lengths. All these numbers are averages:

      len(path)  len(name)  len(path)-len(name)  len(name)/len(path)
root      54.39      16.08                38.31                 0.30
home      99.32      17.86                81.46                 0.18

So we might see significant wins just by comparing names instead of paths,
especially when doing more expensive Unicode comparisons.

I think that summarizes why we might expect `find -s` to be algorithmically
faster than `find | sort` for typical users, and that's probably enough for
tonight. I'll follow up soon with another little patch and some benchmark
results to demonstrate how this plays out in practice. Spoiler: FTS_DEFER_STAT
makes a big difference when using a compare function.

-Diego

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Add sort option to find, Dale R. Worley, 2020/09/02
- Re: Add sort option to find, Diego Ongaro <=
  - Re: Add sort option to find, Dale R. Worley, 2020/09/03
- Re: Add sort option to find, Diego Ongaro, 2020/09/05

Prev by Date: Re: Add sort option to find
Next by Date: Re: Add sort option to find
Previous by thread: Re: Add sort option to find
Next by thread: Re: Add sort option to find
Index(es):
- Date
- Thread