parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: suggestion for new option: --block-break


From: Ole Tange
Subject: Re: suggestion for new option: --block-break
Date: Fri, 3 May 2019 00:56:13 +0200

On Wed, Apr 24, 2019 at 12:06 AM Cook, Malcolm <MEC@stowers.org> wrote:

> I sometimes process files where I would like to be able to allow block 
> boundaries to only occur at rows where “—block-breaks” occur.

So we are talking about --pipe.

> I would like to be able to define such breaks as a perl expression, evaluated 
> for each line, whose value must be different from the prior line for that 
> line to be the beginning of a new block.

In --pipe we try not to look at each line, but to read a block and
find a split point near the end. This is due to performance: Looking
at each line (as you do when you use --pipe -N)

So this will be rather slow.

Parsing each line into columns will be even slower. Probably similar to --shard.

> The expression should be able to refer to columns either by number or by 
> –header name.
>
> For example, I have a program to emits a graph for every protein, where every 
> line is residue of the protein, and there is a column, proteinID, whose value 
> changes when the protein changes which I would like to call as follows:
>
> parallel -j 40 –cat –block 10K --block-breaks proteinID

So that is the example with a column name, I assume the following is
what it would look like with column number 3 in a file with ;
separated values:

parallel --colsep ';' -j 40 –cat –block 10K --block-breaks 3

With perl expression it could be something like:

parallel --colsep ';' -j 40 –cat –block 10K --block-breaks '3
$_=substr($_,-2,2)'

This way it looks a bit similar to a replacement string: {=3
$_=substr($_,-2,2) =}.

> In the meantime I suppose a workaround is to preprocess the input and insert 
> fake –recstart wherever the column changes value.

Yep, that is the recommended solution. You can then use --rrs to
remove that separator when passing the records to the command.

You are basically asking for an option so you do not have to write:

cat foo.tsv |
  perl -F"\t" -ape 'local $_=$F[3]; $_=substr($_,-2,2); if($_ ne
$last) { print "rEcOrDsEp" } $last=$_' |
  parallel --pipe --recstart rEcOrDsEp --rrs --cat --block 10K wc

(with -F = --colsep, [3] = the name/number of the
column,$_=substr($_,-2,2) being the perlexpr, and rEcOrDsEp being a
randomly generated string that hopefully will not occur in your
input).

Is that correctly understood?


/Ole



reply via email to

[Prev in Thread] Current Thread [Next in Thread]