[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Add sort option to find
From: |
Diego Ongaro |
Subject: |
Re: Add sort option to find |
Date: |
Wed, 2 Sep 2020 21:44:34 -0700 |
On Wed, Sep 2, 2020 at 7:22 PM Dale R. Worley <worley@alum.mit.edu> wrote:
>
> Bernhard Voelker <mail@bernhard-voelker.de> writes:
> >> As I wrote in the info page, "the -s option is more efficient than piping
> >> large amounts of output of find into the sort command, [...]
> >
> > Why exactly is that more efficient? Which resources are saved by that?
> > CPU, RAM (*), time, ...?
>
> "find -s" requires find to store, at any one time, the sorted entries of
> all of the directories on the path above the current directory. But
> "find | sort" requires the entire output of find to be stored until find
> completes. In a sort-of-typical-asymptotic way the second is
> exponentially larger than the first. If you made a tree of directories
> containing two entries, 20 layers deep, there would be a million entries
> to be sorted, but find -s would only need to keep track of 40 at any one
> time.
Thanks for the thorough and thoughtful feedback, Berny, and for your useful
example, Dale. I meant algorithmic efficiency in my original claim, but I see
how my wording was ambiguous. Intuitively, sorting names in each directory has
two algorithmic wins: (1) each file must be compared with only its siblings
instead of the entire directory tree, and (2) only names are compared instead
of full paths.
Berny's shell script example is exactly the worst-case scenario. If you put all
your files in one directory, `find -s` will invoke `qsort()` only once, so
that's O(N log N), just like `find | sort`.
Dale's example was a much better scenario for `find -s`. In general, consider
the best case of a perfectly balanced K-ary tree consisting of N total files
and directories. `find -s` needs O(K logK) comparisons per directory, and there
are (N - 1) / k directories. It might be cheating, but if we assume K is
constant, that comes out to O(N) comparisons total. So in this best case
filesystem layout, sorting each directory instead of sorting the output is a
significant algorithmic win.
Most real users don't have the patience to perfectly balance their filesystems.
I ran some stats on my own machine to get a better sense of one real user's
branching factor.
Here are some basic counts:
files dirs both
root 252,668 22,973 275,641
home 669,986 116,673 786,659
both 922,654 139,646 1,062,300
And for each of the directories, how many entries they have:
root home
mean 12.96 6.78
std 99.47 76.36
min 0.00 0.00
25% 1.00 1.00
50% 2.00 2.00
75% 6.00 5.00
90% 18.00 11.00
99% 175.28 69.00
99.9% 994.00 404.00
99.99% 3060.29 1825.95
max 9184.00 21561.00
For those of you following along, this seems to be a cheap and decent
approximation of the average branching factor of a directory tree:
echo $(find | wc -l) / $(find -type d | wc -l) | bc -l
I only have 18 directories with more than 2,000 entries. My largest is my
Firefox profile (21,561 entries). The next is /var/lib/dpkg/info (9,184
entries).
So based on just my own filesystem, maybe something in the ballpark of N=1M and
K=10 is a reasonable guess for real world users. I think that's encouraging.
I also counted the actual number of calls to the compare() function with
`find -s`. It varies a little from run to run, but here's an example:
comparisons total files comparisons/files
root 1,466,827 275,641 5.32
home 3,342,447 786,659 4.25
both 4,809,347 1,062,300 4.53
That, too, is encouraging: we only need about 5 string comparisons per file to
sort my entire filesystem, and I wouldn't expect that number to grow much for
larger filesystems.
Here's the equivalent table for `find | sort`:
comparisons total files comparisons/files
root 4,049,248 275,641 14.69
home 11,254,537 786,659 14.30
both 15,349,917 1,062,300 14.45
It needs many more comparisons even at this scale, and I'd expect it to scale
more poorly with larger filesystems.
I also looked at path lengths vs name lengths. All these numbers are averages:
len(path) len(name) len(path)-len(name) len(name)/len(path)
root 54.39 16.08 38.31 0.30
home 99.32 17.86 81.46 0.18
So we might see significant wins just by comparing names instead of paths,
especially when doing more expensive Unicode comparisons.
I think that summarizes why we might expect `find -s` to be algorithmically
faster than `find | sort` for typical users, and that's probably enough for
tonight. I'll follow up soon with another little patch and some benchmark
results to demonstrate how this plays out in practice. Spoiler: FTS_DEFER_STAT
makes a big difference when using a compare function.
-Diego