parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: suggestion for new option: --block-break


From: Ole Tange
Subject: Re: suggestion for new option: --block-break
Date: Fri, 3 May 2019 22:10:48 +0200

On Fri, May 3, 2019 at 6:30 AM Cook, Malcolm <MEC@stowers.org> wrote:
> > From: Ole Tange <ole@tange.dk>
> > On Wed, Apr 24, 2019 at 12:06 AM Cook, Malcolm <MEC@stowers.org>
> > wrote:
:
> > Parsing each line into columns will be even slower. Probably similar to --
> > shard.
>
> Unfamiliar with "shard" in this context.

--shard (version 20190222 and later)

> > With perl expression it could be something like:
> >
> > parallel --colsep ';' -j 40 –cat –block 10K --block-breaks '3 
> > $_=substr($_,-2,2)'
>
> I'm not sure what that "3" is doing there - some character transliteration 
> problem in our email?.

3 is column 3. So $_ will contain the value in column 3. If no number
given, then $_ is the full line.

This will make it slightly harder distinguishing between a named
column or some perl code. But I think it is OK to assume:

* --block-breaks value contains only [a-z0-9_] and --header : is set
=> Named column
* perl code otherwise

> > You are basically asking for an option so you do not have to write:
> >
> > cat foo.tsv |
> >   perl -F"\t" -ape 'local $_=$F[3]; $_=substr($_,-2,2); if($_ ne
> > $last) { print "rEcOrDsEp" } $last=$_' |
> >   parallel --pipe --recstart rEcOrDsEp --rrs --cat --block 10K wc
> >
> > (with -F = --colsep, [3] = the name/number of the
> > column,$_=substr($_,-2,2) being the perlexpr, and rEcOrDsEp being a
> > randomly generated string that hopefully will not occur in your input).
> >
> > Is that correctly understood?
>
> I think you've got it.  That is pretty much what I wound up doing.

Good.

> And I appreciate your observations about performance above, but, truth be 
> told, the performance hit has to be taken somewhere, either in the upstream 
> perl process or interwoven with `parallels` logic.

That is a valid argument.

Also GNU Parallel is known for having options that are simply
activating wrapper scripts, so it is not completely new territory.

> BTW: Another possible "metaphor" that might be useful in documenting such an 
> option, should you care to implement it, is that of "keeping selective 
> consecutive records together that have some property in common".

Yeah, I really do not like the name --block-breaks. I like --group-by
a little better, but not 100% happy with that either.

So dear mailing list: Please come up with better names and description
for the man page would also be nice.


/Ole



reply via email to

[Prev in Thread] Current Thread [Next in Thread]