[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
GNU Parallel Bug Reports Truncated large records
From: |
Johannes Dröge |
Subject: |
GNU Parallel Bug Reports Truncated large records |
Date: |
Mon, 23 Feb 2015 14:28:07 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 |
Hi Ole and GNU parallel devs,
I'm processing large files (~50 GiB) with variable record sizes and have the
following issues:
1) The processing run-time of individual blocks is more than linear with the
input size. Therefore, it would be best if GNU parallel would allow pass single
records or a fixed number of records for each job, or at least would not
automatically increase the block size. Instead, the block size auto-detection
increases the block size on large individual blocks until only very few
processes are being run in parallel which then dominate the overall run-time.
This behavior strongly impacts the granularity of the parallel execution.
2) I'm seeing that large records (>2 GiB) are being truncated at 2 GiB and thus
passed incompletely via stdin. You find my compressed input under
https://elefant.bifo.helmholtz-hzi.de/public.php?service=files&t=48fb2c2e7ba7ace340acf37ffe9803f3
(~1.2 GiB, valid until March 2015)
and I'm processing the data as follows:
zcat debug.maf.gz | parallel --halt-on-error --no-notice --gnu --pipe
--recstart '# batch ' --recend '\n\n' 'cat > "$PARALLEL_SEQ".maf'
You will see that only one job and output file is created because the first
record is the largest one. Then, the output is truncated after exactly 2 GiB. I
think this is a serious issue as this is a silent data corruption and will
affect the analysis if, for instance biological sequence data is shortened
before analysis.
Info: I'm using the latest version of GNU parallel (20150122) on 64 bit Linux,
Debian 7.
Thanks for your help.
Gruß Johannes
--
Johannes Dröge, M.Sc.
Algorithmic Bioinformatics, Heinrich Heine University Düsseldorf
25.12.01.50, Universitätsstraße 1, 40225 Düsseldorf, Germany
PGP: http://keys.fungs.de/6ea5e4.asc (55F2720303A7F236A94666F20E2360727A6EA5E4)
Web: algbio.cs.uni-duesseldorf.de | Tel/Fax: +49 211 81-12644/13464
signature.asc
Description: OpenPGP digital signature
- GNU Parallel Bug Reports Truncated large records,
Johannes Dröge <=