[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: suggestion for new option: --block-break
From: |
Ole Tange |
Subject: |
Re: suggestion for new option: --block-break |
Date: |
Mon, 6 May 2019 01:22:39 +0200 |
On Sat, May 4, 2019 at 3:53 AM Cook, Malcolm <MEC@stowers.org> wrote:
>
> > On Fri, May 3, 2019 at 6:30 AM Cook, Malcolm <MEC@stowers.org> wrote:
> > > > From: Ole Tange <ole@tange.dk>
> > > > On Wed, Apr 24, 2019 at 12:06 AM Cook, Malcolm <MEC@stowers.org>
> > > > wrote:
> > :
> > > > Parsing each line into columns will be even slower. Probably similar
> > > > to -- shard.
> > >
> > > Unfamiliar with "shard" in this context.
> >
> > --shard (version 20190222 and later)
>
> Aha. I'm still running 20180322. I just now read the new manpage
> on-line.....
>
> Yes, in fact, your `--shard` sounds quite similar in purpose, ensuring " all
> lines of a given value is given to the same job slot."
>
> Do I understand correctly that as `--shard` is implemented now:
> 1) such lines need not even be consecutive
Yes.
> 2) all the rows given to a jobslot *only* the values from a *single*
> shardkey
No. The shardkey is hashed and given to the jobslot: shardkey modulo
number_of_jobslots.
It is basically a parallelized version of --round-robin, but it makes
sure that all with a given shard key is given to the same jobslot.
> > > > Is that correctly understood?
> > >
> > > I think you've got it. That is pretty much what I wound up doing.
Git now has an initial version of --group-by:
--group-by val (alpha testing)
Group input by value. Combined with --pipe --group-by groups lines
with the same value into a record.
The value can be computed from the full line or from a single column.
val can be:
column number Use the value in the column numbered.
column name Treat the first line as a header and use the value
in the column named.
perl expression
Run the perl expression and use $_ as the value.
column number perl expression
Put the value of the column put in $_, run the perl
expression, and use $_ as the value.
column name perl expression
Put the value of the column put in $_, run the perl
expression, and use $_ as the value.
Example:
UserID, Consumption
123, 1
123, 2
12-3, 1
221, 3
221, 1
2/21, 5
If you want to group 123, 12-3, 221, and 2/21 into 4 records and
pass one record at a time to wc:
tail -n +2 table.csv | \
parallel --pipe --colsep , --group-by 1 -kN1 wc
Make GNU parallel treat the first line as a header:
cat table.csv | \
parallel --pipe --colsep , --header : --group-by 1 -kN1 wc
Address column by column name:
cat table.csv | \
parallel --pipe --colsep , --header : --group-by UserID -kN1 wc
If 12-3 and 123 are really the same UserID, remove non-digits in
UserID when grouping:
cat table.csv | parallel --pipe --colsep , --header : \
--group-by 'UserID s/\D//g' -kN1 wc
See also --shard.
Give it a spin and see if you can break it.
/Ole