parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: suggestion for new option: --block-break


From: Cook, Malcolm
Subject: RE: suggestion for new option: --block-break
Date: Fri, 3 May 2019 04:30:33 +0000

> -----Original Message-----
> From: Ole Tange <ole@tange.dk>
> Sent: Thursday, May 2, 2019 5:56 PM
> To: Cook, Malcolm <MEC@stowers.org>
> Cc: parallel@gnu.org
> Subject: Re: suggestion for new option: --block-break
> 
> **CAUTION: Non-Stowers email**
> 
> 
> On Wed, Apr 24, 2019 at 12:06 AM Cook, Malcolm <MEC@stowers.org>
> wrote:
> 
> > I sometimes process files where I would like to be able to allow block
> boundaries to only occur at rows where “—block-breaks” occur.
> 
> So we are talking about --pipe.


Yes

> 
> > I would like to be able to define such breaks as a perl expression, 
> > evaluated
> for each line, whose value must be different from the prior line for that line
> to be the beginning of a new block.
> 
> In --pipe we try not to look at each line, but to read a block and find a 
> split
> point near the end. This is due to performance: Looking at each line (as you
> do when you use --pipe -N)
> 
> So this will be rather slow.

Right.   Got it.

> 
> Parsing each line into columns will be even slower. Probably similar to --
> shard.

Unfamiliar with "shard" in this context.

> 
> > The expression should be able to refer to columns either by number or by
> –header name.
> >
> > For example, I have a program to emits a graph for every protein, where
> every line is residue of the protein, and there is a column, proteinID, whose
> value changes when the protein changes which I would like to call as follows:
> >
> > parallel -j 40 –cat –block 10K --block-breaks proteinID
> 
> So that is the example with a column name, I assume the following is what it
> would look like with column number 3 in a file with ; separated values:
> 
> parallel --colsep ';' -j 40 –cat –block 10K --block-breaks 3

That's the idea.

> 
> With perl expression it could be something like:
> 
> parallel --colsep ';' -j 40 –cat –block 10K --block-breaks '3 
> $_=substr($_,-2,2)'

I'm not sure what that "3" is doing there - some character transliteration 
problem in our email?.

> 
> This way it looks a bit similar to a replacement string: {=3
> $_=substr($_,-2,2) =}.

Building on our existing replacement string syntax would make sense.

> 
> > In the meantime I suppose a workaround is to preprocess the input and
> insert fake –recstart wherever the column changes value.
> 
> Yep, that is the recommended solution. You can then use --rrs to remove
> that separator when passing the records to the command.

I think I've even read documentation of yours at one point advocating for this 
as recommend workaround.  Anyway, it worked for me.  

> You are basically asking for an option so you do not have to write:
> 
> cat foo.tsv |
>   perl -F"\t" -ape 'local $_=$F[3]; $_=substr($_,-2,2); if($_ ne
> $last) { print "rEcOrDsEp" } $last=$_' |
>   parallel --pipe --recstart rEcOrDsEp --rrs --cat --block 10K wc
> 
> (with -F = --colsep, [3] = the name/number of the
> column,$_=substr($_,-2,2) being the perlexpr, and rEcOrDsEp being a
> randomly generated string that hopefully will not occur in your input).
> 
> Is that correctly understood?

I think you've got it.  That is pretty much what I wound up doing.

And I appreciate your observations about performance above, but, truth be told, 
the performance hit has to be taken somewhere, either in the upstream perl 
process or interwoven with `parallels` logic.

BTW: Another possible "metaphor" that might be useful in documenting such an 
option, should you care to implement it, is that of "keeping selective 
consecutive records together that have some property in common".

Thanks - this is really not urgent at all - a convenience - "syntactic sugar" 
if you will.

Thanks for parallel

`Malcolm

> 
> 
> /Ole

reply via email to

[Prev in Thread] Current Thread [Next in Thread]