[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: suggestion for new option: --block-break
From: |
Achim Gratz |
Subject: |
Re: suggestion for new option: --block-break |
Date: |
Sat, 04 May 2019 08:24:40 +0200 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/26.2 (gnu/linux) |
Ole Tange writes:
>> > parallel --colsep ';' -j 40 –cat –block 10K --block-breaks '3
>> > $_=substr($_,-2,2)'
>>
>> I'm not sure what that "3" is doing there - some character transliteration
>> problem in our email?.
>
> 3 is column 3. So $_ will contain the value in column 3. If no number
> given, then $_ is the full line.
>
> This will make it slightly harder distinguishing between a named
> column or some perl code. But I think it is OK to assume:
>
> * --block-breaks value contains only [a-z0-9_] and --header : is set
> => Named column
> * perl code otherwise
I think it should be an interesting extension of parallel indeed. If I
gather the OP's requirements right, the column he wants to do the block
break on is producing a continous row section. I'm not familiar with
the data formats of genomics, but I believe that some of them might even
have fixed line lengths. That would allow for a binary search to figure
out the break point before going into the blocking algo, which would be
a net win if the number of blocks to read for the preprocessing is a
small fraction of the total blocks only.
If so, it really would be a preprocessing step to run before entering
parallel and the extension to parallel would be to enable handing off a
list of blocks (that parallel may further split) to it.
> Yeah, I really do not like the name --block-breaks. I like --group-by
> a little better, but not 100% happy with that either.
Or --scatter / --split(-*)?
Regards,
Achim.
--
+<[Q+ Matrix-12 WAVE#46+305 Neuron microQkb Andromeda XTk Blofeld]>+
Factory and User Sound Singles for Waldorf rackAttack:
http://Synth.Stromeko.net/Downloads.html#WaldorfSounds