Re: suggestion for new option: --block-break

parallel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: suggestion for new option: --block-break

From:	Ole Tange
Subject:	Re: suggestion for new option: --block-break
Date:	Mon, 6 May 2019 01:22:39 +0200

On Sat, May 4, 2019 at 3:53 AM Cook, Malcolm <MEC@stowers.org> wrote:
>
> > On Fri, May 3, 2019 at 6:30 AM Cook, Malcolm <MEC@stowers.org> wrote:
> > > > From: Ole Tange <ole@tange.dk>
> > > > On Wed, Apr 24, 2019 at 12:06 AM Cook, Malcolm <MEC@stowers.org>
> > > > wrote:
> > :
> > > > Parsing each line into columns will be even slower. Probably similar
> > > > to -- shard.
> > >
> > > Unfamiliar with "shard" in this context.
> >
> > --shard (version 20190222 and later)
>
> Aha.  I'm still running 20180322.  I just now read the new manpage 
> on-line.....
>
> Yes, in fact, your `--shard` sounds quite similar in purpose, ensuring " all 
> lines of a given value is given to the same job slot."
>
> Do I understand correctly that as `--shard` is implemented now:
>  1) such lines need not even be consecutive

Yes.

>  2) all the rows given to a jobslot  *only* the values from a *single* 
> shardkey

No. The shardkey is hashed and given to the jobslot: shardkey modulo
number_of_jobslots.

It is basically a parallelized version of --round-robin, but it makes
sure that all with a given shard key is given to the same jobslot.

> > > > Is that correctly understood?
> > >
> > > I think you've got it.  That is pretty much what I wound up doing.

Git now has an initial version of --group-by:

--group-by val (alpha testing)
    Group input by value. Combined with --pipe --group-by groups lines
with the same value into a record.

    The value can be computed from the full line or from a single column.

    val can be:

     column number Use the value in the column numbered.

     column name   Treat the first line as a header and use the value
in the column named.

     perl expression
                   Run the perl expression and use $_ as the value.

     column number perl expression
                   Put the value of the column put in $_, run the perl
expression, and use $_ as the value.

     column name perl expression
                   Put the value of the column put in $_, run the perl
expression, and use $_ as the value.

    Example:

      UserID, Consumption
      123, 1
      123, 2
      12-3, 1
      221, 3
      221, 1
      2/21, 5

    If you want to group 123, 12-3, 221, and 2/21 into 4 records and
pass one record at a time to wc:

      tail -n +2 table.csv | \
        parallel --pipe --colsep , --group-by 1 -kN1 wc

    Make GNU parallel treat the first line as a header:

      cat table.csv | \
        parallel --pipe --colsep , --header : --group-by 1 -kN1 wc

    Address column by column name:

      cat table.csv | \
        parallel --pipe --colsep , --header : --group-by UserID -kN1 wc

    If 12-3 and 123 are really the same UserID, remove non-digits in
UserID when grouping:

      cat table.csv | parallel --pipe --colsep , --header : \
        --group-by 'UserID s/\D//g' -kN1 wc

    See also --shard.

Give it a spin and see if you can break it.

/Ole

[Prev in Thread]

Current Thread

[Next in Thread]

Re: suggestion for new option: --block-break, Ole Tange, 2019/05/02
- RE: suggestion for new option: --block-break, Cook, Malcolm, 2019/05/03
  - Re: suggestion for new option: --block-break, Ole Tange, 2019/05/03
    - RE: suggestion for new option: --block-break, Cook, Malcolm, 2019/05/03
    - Re: suggestion for new option: --block-break, Ole Tange <=
    - Re: suggestion for new option: --block-break, Achim Gratz, 2019/05/04

Prev by Date: Re: suggestion for new option: --block-break
Next by Date: GNU Parallel 20190522 ('Akihito') released
Previous by thread: RE: suggestion for new option: --block-break
Next by thread: Re: suggestion for new option: --block-break
Index(es):
- Date
- Thread