[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: suggestion for new option: --block-break
From: |
Ole Tange |
Subject: |
Re: suggestion for new option: --block-break |
Date: |
Fri, 3 May 2019 00:56:13 +0200 |
On Wed, Apr 24, 2019 at 12:06 AM Cook, Malcolm <MEC@stowers.org> wrote:
> I sometimes process files where I would like to be able to allow block
> boundaries to only occur at rows where “—block-breaks” occur.
So we are talking about --pipe.
> I would like to be able to define such breaks as a perl expression, evaluated
> for each line, whose value must be different from the prior line for that
> line to be the beginning of a new block.
In --pipe we try not to look at each line, but to read a block and
find a split point near the end. This is due to performance: Looking
at each line (as you do when you use --pipe -N)
So this will be rather slow.
Parsing each line into columns will be even slower. Probably similar to --shard.
> The expression should be able to refer to columns either by number or by
> –header name.
>
> For example, I have a program to emits a graph for every protein, where every
> line is residue of the protein, and there is a column, proteinID, whose value
> changes when the protein changes which I would like to call as follows:
>
> parallel -j 40 –cat –block 10K --block-breaks proteinID
So that is the example with a column name, I assume the following is
what it would look like with column number 3 in a file with ;
separated values:
parallel --colsep ';' -j 40 –cat –block 10K --block-breaks 3
With perl expression it could be something like:
parallel --colsep ';' -j 40 –cat –block 10K --block-breaks '3
$_=substr($_,-2,2)'
This way it looks a bit similar to a replacement string: {=3
$_=substr($_,-2,2) =}.
> In the meantime I suppose a workaround is to preprocess the input and insert
> fake –recstart wherever the column changes value.
Yep, that is the recommended solution. You can then use --rrs to
remove that separator when passing the records to the command.
You are basically asking for an option so you do not have to write:
cat foo.tsv |
perl -F"\t" -ape 'local $_=$F[3]; $_=substr($_,-2,2); if($_ ne
$last) { print "rEcOrDsEp" } $last=$_' |
parallel --pipe --recstart rEcOrDsEp --rrs --cat --block 10K wc
(with -F = --colsep, [3] = the name/number of the
column,$_=substr($_,-2,2) being the perlexpr, and rEcOrDsEp being a
randomly generated string that hopefully will not occur in your
input).
Is that correctly understood?
/Ole
- Re: suggestion for new option: --block-break,
Ole Tange <=