parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: suggestion for new option: --block-break


From: Cook, Malcolm
Subject: RE: suggestion for new option: --block-break
Date: Sat, 4 May 2019 01:51:36 +0000

> On Fri, May 3, 2019 at 6:30 AM Cook, Malcolm <MEC@stowers.org> wrote:
> > > From: Ole Tange <ole@tange.dk>
> > > On Wed, Apr 24, 2019 at 12:06 AM Cook, Malcolm <MEC@stowers.org>
> > > wrote:
> :
> > > Parsing each line into columns will be even slower. Probably similar
> > > to -- shard.
> >
> > Unfamiliar with "shard" in this context.
> 
> --shard (version 20190222 and later)

Aha.  I'm still running 20180322.  I just now read the new manpage on-line.....

Yes, in fact, your `--shard` sounds quite similar in purpose, ensuring " all 
lines of a given value is given to the same job slot."

Do I understand correctly that as `--shard` is implemented now:
 1) such lines need not even be consecutive 
 2) all the rows given to a jobslot  *only* the values from a *single* shardkey

This is different from my current use case in that
  3) lines ARE in fact consecutive (potentially allowing jobs to be submitted 
as the file is being scanned)
  4) lines from multiple consecutive shardkeys could be allowed (potentially 
allowing fewer larger jobs to be submitted)

All this said, however, for my use case, a workaround for not having the 
requested --block-break would be to upgrade my installation and use `--shard`.

Perhaps we should consider my request as a way to modify the behavior of 
`--shard` by
 (a)  declaring that lines with the same shardkey are consecutive (allowing the 
faster processing of (3))
 (b) allowing a single job to potentially hold multiple lines from multiple 
shardkeys 

In any case, as I mentioned, I have a workaround, and now I have two.

> > > With perl expression it could be something like:
> > >
> > > parallel --colsep ';' -j 40 –cat –block 10K --block-breaks '3 
> > > $_=substr($_,-
> 2,2)'
> >
> > I'm not sure what that "3" is doing there - some character transliteration
> problem in our email?.
> 
> 3 is column 3. So $_ will contain the value in column 3. If no number given,
> then $_ is the full line.
> 
> This will make it slightly harder distinguishing between a named column or
> some perl code. But I think it is OK to assume:
> 
> * --block-breaks value contains only [a-z0-9_] and --header : is set => Named
> column
> * perl code otherwise
> 
> > > You are basically asking for an option so you do not have to write:
> > >
> > > cat foo.tsv |
> > >   perl -F"\t" -ape 'local $_=$F[3]; $_=substr($_,-2,2); if($_ ne
> > > $last) { print "rEcOrDsEp" } $last=$_' |
> > >   parallel --pipe --recstart rEcOrDsEp --rrs --cat --block 10K wc
> > >
> > > (with -F = --colsep, [3] = the name/number of the
> > > column,$_=substr($_,-2,2) being the perlexpr, and rEcOrDsEp being a
> > > randomly generated string that hopefully will not occur in your input).
> > >
> > > Is that correctly understood?
> >
> > I think you've got it.  That is pretty much what I wound up doing.
> 
> Good.
> 
> > And I appreciate your observations about performance above, but, truth
> be told, the performance hit has to be taken somewhere, either in the
> upstream perl process or interwoven with `parallels` logic.
> 
> That is a valid argument.
> 
> Also GNU Parallel is known for having options that are simply activating
> wrapper scripts, so it is not completely new territory.

Yes - this could be probably be implemented as a case of parallel filtering its 
own input...

> > BTW: Another possible "metaphor" that might be useful in documenting
> such an option, should you care to implement it, is that of "keeping selective
> consecutive records together that have some property in common".
> 
> Yeah, I really do not like the name --block-breaks. I like --group-by a little
> better, but not 100% happy with that either.

In word-processing there is the idea of "keep together" or "controlling 
pagination" - perhaps there is a better metaphor there.

> 
> So dear mailing list: Please come up with better names and description for
> the man page would also be nice.
> 
> 
> /Ole

reply via email to

[Prev in Thread] Current Thread [Next in Thread]