Compressing release packages better

bug-gettext

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Compressing release packages better

From:	Lasse Collin
Subject:	Compressing release packages better
Date:	Tue, 7 Jan 2025 20:49:32 +0200

Hello!

I noticed that the gettext release .tar.xz files could be smaller. For
example, gettext-0.23.1.tar.xz is 10.5 MiB but it could be under 7 MiB.
This obviously isn't an important thing but in case there is interest
in improving the file size, I wrote a few thoughts below.

Admin/release-steps mentions "xz -c -e" and gettext-0.23.1.tar.xz
matches this. "xz -e" is equivalent to "xz -6e". The -e makes it slower
while keeping compressor and decompressor memory usage the same as
"xz -6". With this tarball, going from -6 to -6e reduces file size by
1.7 % while making compression take 70 % longer.

Higher settings use more RAM for both compression and decompression.
xz -6 and -6e use 8 MiB dictionary, -9 and -9e use 64 MiB, but one can
go higher which happens to help unusually much with this package:

    Options                                Size  Comp.Mem.  Decomp.Mem.
    -T1 -6e                            10.5 MiB     94 MiB        9 MiB
    -T1 -9e                             8.3 MiB    674 MiB       65 MiB
    -T1 --lzma2=preset=9e,dict=128MiB   7.1 MiB   1346 MiB      129 MiB
    -T1 --lzma2=preset=9e,dict=192MiB   6.8 MiB   2082 MiB      193 MiB

I included -T1 (alias --threads=1) above because multithreaded mode is
the default in xz 5.6.x. In xz >= 5.2.0, threading can be enabled with
the command line option. Threading makes compression slightly worse:
with the gettext tarball, "xz -6e" in threaded mode creates 3.6 % bigger
file.

Automake defaults to "xz -e". I think it was sensible in 2010-2011. (See
the Automake commit c8e01d581a7e.) What is considered an old or low-end
machine is quite different nowadays. I don't know if requiring 193 MiB
to decompress a gettext package is acceptable; if not, I certainly
understand that. For comparison, coreutils-9.5.tar.xz has been
compressed with "xz -T1 -8e" and needs 33 MiB of memory to decompress.

Gettext tarball contains identical copies of several files (mostly
from Gnulib). When similar files are spread far apart in the tarball,
LZ77-based compressors cannot deduplicate them unless dictionary size
(=history buffer size) is big enough.

Sorting similar files together helps all compressors. A small package
can be produced with a smaller dictionary size and thus less RAM usage.
Sorting by file basenames is simple and produces good results with
Gettext (it may be bad with other packages; this isn't generic advice):

    ( find gettext-0.23.1 -type d | LC_ALL=C sort
      find gettext-0.23.1 ! -type d -printf '%f\t%p\n' \
          | LC_ALL=C sort | cut -f2
    ) | tar chf - --format=ustar --owner=root --group=root \
            --no-recursion --files-from=- \
      | xz -T1 -9e > new.tar.xz

Results:

    Options                             Old size   New size
    -T1 -6e                             10.5 MiB    7.6 MiB
    -T1 -7e                             10.1 MiB    7.4 MiB
    -T1 -8e                              8.7 MiB    7.1 MiB
    -T1 -9e                              8.3 MiB    6.9 MiB
    -T1 --lzma2=preset=9e,dict=128MiB    7.1 MiB    6.9 MiB
    -T1 --lzma2=preset=9e,dict=192MiB    6.8 MiB    6.9 MiB

    -T8 -6e                             10.9 MiB    8.0 MiB
    -T4 -7e                             10.4 MiB    7.6 MiB

The downside of sorting is that "tar tf" output isn't pretty.

I see that configure.ac overrides am__tar. Sorting can be put there too:

    am__tar='( find "$$tardir" -type d | LC_ALL=C sort ; find "$$tardir" ! 
-type d -printf '\''%f\t%p\n'\'' | LC_ALL=C sort | cut -f2 ) | ${AMTAR} chf - 
--format=ustar --owner=root --group=root --no-recursion --files-from=-'

Bonus: If one uses the long --lzma2 option, appending ",pb=0" helps a
*tiny* amount (like 0.2 % to 0.6 %) with ASCII/UTF-8 text (including
source code tarballs) without downsides (apart from making the command
line uglier). Example:

    xz -T1 --lzma2=preset=9e,pb=0

-- 
Lasse Collin

[Prev in Thread]

Current Thread

[Next in Thread]

Compressing release packages better, Lasse Collin <=
- Re: Compressing release packages better, Bruno Haible, 2025/01/08
  - Re: Compressing release packages better, Lasse Collin, 2025/01/09
    - Re: Compressing release packages better, Bruno Haible, 2025/01/09

Prev by Date: [bug #66643] JavaScript parser stops after it encounters a JSX tag within a JSX tag attribute
Next by Date: Re: Compressing release packages better
Previous by thread: [bug #66643] JavaScript parser stops after it encounters a JSX tag within a JSX tag attribute
Next by thread: Re: Compressing release packages better
Index(es):
- Date
- Thread