Processing a big file using more CPUs

parallel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Processing a big file using more CPUs

From:	Nio Wiklund
Subject:	Processing a big file using more CPUs
Date:	Mon, 4 Feb 2019 21:52:55 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0

Hi parallel users,

Background

EXAMPLE: Processing a big file using more CPUs

To process a big file or some output you can use --pipe to split up thedata into blocks and pipe the blocks into the processing program.


If the program is gzip -9 you can do:

  cat bigfile | parallel --pipe --recend '' -k gzip -9 > bigfile.gz

This will split bigfile into blocks of 1 MB and pass that to gzip -9 inparallel. One gzip will be run per CPU. The output of gzip -9 will bekept in order and saved to bigfile.gz


Question

I would like to create blocks of suitable size for each cpu/thread forbinary files, like it is possible with --pipepart --block -1 for textfiles (with lines).


I have tried but can only get 1 MiB size block (default).

The reason why I want this is that I often create compressed images ofthe content of a drive, /dev/sdx, and I lose approximately half thecomptression improvement from gzip to xz, when using parallel. Theimprovement in speed is good, 2.5 times, but I think larger blocks wouldgive xz a chance to get a compression much closer to what it can getwithout parallel.


Is it possible with with the current code? In that case how?

Otherwise I think it would be a good idea to modify the code to make itpossible.


Best regards
Nio

[Prev in Thread]

Current Thread

[Next in Thread]

Processing a big file using more CPUs, Nio Wiklund <=
- Re: Processing a big file using more CPUs, Ole Tange, 2019/02/11
  - Re: Processing a big file using more CPUs, Nio Wiklund, 2019/02/12
  - Re: Processing a big file using more CPUs, Shlomi Fish, 2019/02/12

Next by Date: Re: Processing a big file using more CPUs
Next by thread: Re: Processing a big file using more CPUs
Index(es):
- Date
- Thread