[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#60506: feature: parallel grep --recursive
From: |
David G. Pickett |
Subject: |
bug#60506: feature: parallel grep --recursive |
Date: |
Tue, 3 Jan 2023 21:15:18 +0000 (UTC) |
It seems like we have 2 suggestions: parallel in different files and parallel
is large files.
- Parallel in different files is two ways tricky since you need threads and
mutex on the file name stream, and in addition for parallel directories, some
sort of threads and queue to pass the file names (producers) to the grep's
(consumers).
- You might need a following consumer layer to ensure the output lines are
in order or at very least not commingled. A big FILE* buffer and fflush() can
ensure each lines is a write(), but you give up original ordering unless you
arrange to arrange or sort the output.
- You probably want to set a thread count limit.
- You might want to start with one file name producer, one grep
consumer-producer and one arrange/sort consumer, and add more threads to which
ever upstream side is emptying/filling a fixed sized queue.
- But of course, a lot of this is available from "parallel" if you make a
study of it!
- I made a C pipe fitting I called xdemux to take a stream like file name
lines from stdin and spread it in rotation to N downstream popen() pipes to a
given command, like xargs grep. N can be set to 2 x your local core count so
it is less likely to block on IO, paging, or congestion.
- I also wrote a simpler, line oriented, faster xargs, fxargs!
- I also wrote a C tool I called pipebuf to buffer stdin to stdout so one
slow consumer does not stop others from getting work, but more parallelism is a
simpler solution.
- Threads in Intel Hyperthreaded CPUs can run twice as many in parallel as
with parallel processes.
- Parallel in large files reminds me of AbInitio ETL, which I assume divides a
file into N portions, but each thread is responsible for any line that starts
in its portion, even if it ends in another. Merging output to present hits in
order requires some sort of buffering or sorting of output. For very simple
grep (is it in there?), you need to design it so you can call off the other
threads on any hit.
Doing both the above simultaneously would be a lot! Either is a lot to focus
on what is one of many simple tools! Other tools might want similar
enhancement! :D
File read speeds vary wildly, between network drives on various speed and
congestion networks, spinning hard drives of various RPM and bit density, solid
state drives, and then files cached in DRAM (most read IO uses mmap64()), not
to mention in MOBO and CPU caches at many levels. I wrote a mmap64() based
fgrep and it turned out to be so "good" on a big file list that ALL the other
processes on the group's server got swapped out big time (without parallelism)!
-----Original Message-----
From: Paul Jackson <pj@usa.net>
To: Paul Eggert <eggert@cs.ucla.edu>; 60506@debbugs.gnu.org
Sent: Mon, Jan 2, 2023 9:56 pm
Subject: bug#60506: feature: parallel grep --recursive
<< a parallel grep to search a single large file >>
I'm but one user, and a rather idiosyncratic user at that,
but for my usage patterns, the specialized logic that it
would take to run a parallelized grep on a large file
would likely not shrink the elapsed time enough to justify
the coding, documentation, and maintenance effort.
I would expect the time to read the large file in from disk to
dominate the total elapsed time in any case.
(or maybe I am just jealous that I didn't think of that parallel
grep use case myself <grin>.)
--
Paul Jackson
pj@usa.net
bug#60506: parallel grep, Eike Dierks, 2023/01/06