bug-gettext
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Compressing release packages better


From: Bruno Haible
Subject: Re: Compressing release packages better
Date: Wed, 08 Jan 2025 10:45:44 +0100

Hi Lasse,

> I noticed that the gettext release .tar.xz files could be smaller. For
> example, gettext-0.23.1.tar.xz is 10.5 MiB but it could be under 7 MiB.
> This obviously isn't an important thing but in case there is interest
> in improving the file size, I wrote a few thoughts below.

Thank you for the suggestions. I never had paid much attention to it
(because today's networks and disks cope well with large files: when people
download a 1.5 hours movie that's already 1 GB or 2 GB).

> Gettext tarball contains identical copies of several files (mostly
> from Gnulib). When similar files are spread far apart in the tarball,
> LZ77-based compressors cannot deduplicate them unless dictionary size
> (=history buffer size) is big enough.
> 
> Sorting similar files together helps all compressors. A small package
> can be produced with a smaller dictionary size and thus less RAM usage.
> Sorting by file basenames is simple and produces good results with
> Gettext (it may be bad with other packages; this isn't generic advice):
> 
>     ( find gettext-0.23.1 -type d | LC_ALL=C sort
>       find gettext-0.23.1 ! -type d -printf '%f\t%p\n' \
>           | LC_ALL=C sort | cut -f2
>     ) | tar chf - --format=ustar --owner=root --group=root \
>             --no-recursion --files-from=- \
>       | xz -T1 -9e > new.tar.xz
> 
> Results:
> 
>     Options                             Old size   New size
>     -T1 -6e                             10.5 MiB    7.6 MiB
>     -T1 -7e                             10.1 MiB    7.4 MiB
>     -T1 -8e                              8.7 MiB    7.1 MiB
>     -T1 -9e                              8.3 MiB    6.9 MiB
>     -T1 --lzma2=preset=9e,dict=128MiB    7.1 MiB    6.9 MiB
>     -T1 --lzma2=preset=9e,dict=192MiB    6.8 MiB    6.9 MiB
> 
>     -T8 -6e                             10.9 MiB    8.0 MiB
>     -T4 -7e                             10.4 MiB    7.6 MiB
> 
> The downside of sorting is that "tar tf" output isn't pretty.

That's a very nice suggestion. I confirm that it helps all compressors:

$ tar chf by-dirname.tar --format=ustar --owner=root --group=root gettext-0.23.1
$ { find gettext-0.23.1 -type d | LC_ALL=C sort; find gettext-0.23.1 ! -type d 
-printf '%f\t%p\n'  | LC_ALL=C sort | cut -f2; } | tar chf by-basename.tar 
--format=ustar --owner=root --group=root  --no-recursion --files-from=-

                   by-dirname   by-basename

.tar               157839360    157839360
gzip -9             28592463     26078767
bzip2               20692075     17613661
zstd -3             20748253     14594848
zstd -9             15433711     10619250
zstd -19            12275923      8630896
xz -e               11490324      7958904
lzip -9              8976775      7420203
xz -9 -e             7711432      7286512

> I see that configure.ac overrides am__tar. Sorting can be put there too:
> 
>     am__tar='( find "$$tardir" -type d | LC_ALL=C sort ; find "$$tardir" ! 
> -type d -printf '\''%f\t%p\n'\'' | LC_ALL=C sort | cut -f2 ) | ${AMTAR} chf - 
> --format=ustar --owner=root --group=root --no-recursion --files-from=-'

Yes. Done in a slightly different way (no subshell, run 'cut' in the C locale
as well) in
https://git.savannah.gnu.org/gitweb/?p=gettext.git;a=commitdiff;h=c8977e01e48e47cbd9e9c6d4538d0cc5e1fb0110
Thanks for the suggestion!

Note that is package-specific. For instance, I think the *.po files sort
more efficiently in the by-dirname order, but in the gettext tarball the
copied files are more dominant.

> Admin/release-steps mentions "xz -c -e" and gettext-0.23.1.tar.xz
> matches this. "xz -e" is equivalent to "xz -6e". The -e makes it slower
> while keeping compressor and decompressor memory usage the same as
> "xz -6". With this tarball, going from -6 to -6e reduces file size by
> 1.7 % while making compression take 70 % longer.
> 
> Higher settings use more RAM for both compression and decompression.
> xz -6 and -6e use 8 MiB dictionary, -9 and -9e use 64 MiB, but one can
> go higher which happens to help unusually much with this package:
> 
>     Options                                Size  Comp.Mem.  Decomp.Mem.
>     -T1 -6e                            10.5 MiB     94 MiB        9 MiB
>     -T1 -9e                             8.3 MiB    674 MiB       65 MiB
>     -T1 --lzma2=preset=9e,dict=128MiB   7.1 MiB   1346 MiB      129 MiB
>     -T1 --lzma2=preset=9e,dict=192MiB   6.8 MiB   2082 MiB      193 MiB
> ...
> Automake defaults to "xz -e". I think it was sensible in 2010-2011. (See
> the Automake commit c8e01d581a7e.) What is considered an old or low-end
> machine is quite different nowadays. I don't know if requiring 193 MiB
> to decompress a gettext package is acceptable

Decompression memory requirements still matter, though. For example:
  - Embedded Linux systems often have only 256 MB of RAM.
  - I occasionally use a laptop with 1 GiB of RAM, or a smartphone with
    2.75 GiB of RAM.
  - In cloud environments, the price is proportional to the RAM size.
    Therefore it is not unusual to work with VMs with 0.5 GiB of RAM.

> For comparison, coreutils-9.5.tar.xz has been
> compressed with "xz -T1 -8e" and needs 33 MiB of memory to decompress.

A decompression memory requirement of 33 MiB is reasonable, I would say.

> Bonus: If one uses the long --lzma2 option, appending ",pb=0" helps a
> *tiny* amount (like 0.2 % to 0.6 %) with ASCII/UTF-8 text (including
> source code tarballs) without downsides (apart from making the command
> line uglier). Example:
> 
>     xz -T1 --lzma2=preset=9e,pb=0

I try to avoid options here that few people use, so as to minimize the
risk of running into trouble. Reducing the .xz size from 11 MB to 8 MB
is good enough; I don't need further tuning if it comes with some risks.

Bruno






reply via email to

[Prev in Thread] Current Thread [Next in Thread]