[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Compressing release packages better
From: |
Lasse Collin |
Subject: |
Compressing release packages better |
Date: |
Tue, 7 Jan 2025 20:49:32 +0200 |
Hello!
I noticed that the gettext release .tar.xz files could be smaller. For
example, gettext-0.23.1.tar.xz is 10.5 MiB but it could be under 7 MiB.
This obviously isn't an important thing but in case there is interest
in improving the file size, I wrote a few thoughts below.
Admin/release-steps mentions "xz -c -e" and gettext-0.23.1.tar.xz
matches this. "xz -e" is equivalent to "xz -6e". The -e makes it slower
while keeping compressor and decompressor memory usage the same as
"xz -6". With this tarball, going from -6 to -6e reduces file size by
1.7 % while making compression take 70 % longer.
Higher settings use more RAM for both compression and decompression.
xz -6 and -6e use 8 MiB dictionary, -9 and -9e use 64 MiB, but one can
go higher which happens to help unusually much with this package:
Options Size Comp.Mem. Decomp.Mem.
-T1 -6e 10.5 MiB 94 MiB 9 MiB
-T1 -9e 8.3 MiB 674 MiB 65 MiB
-T1 --lzma2=preset=9e,dict=128MiB 7.1 MiB 1346 MiB 129 MiB
-T1 --lzma2=preset=9e,dict=192MiB 6.8 MiB 2082 MiB 193 MiB
I included -T1 (alias --threads=1) above because multithreaded mode is
the default in xz 5.6.x. In xz >= 5.2.0, threading can be enabled with
the command line option. Threading makes compression slightly worse:
with the gettext tarball, "xz -6e" in threaded mode creates 3.6 % bigger
file.
Automake defaults to "xz -e". I think it was sensible in 2010-2011. (See
the Automake commit c8e01d581a7e.) What is considered an old or low-end
machine is quite different nowadays. I don't know if requiring 193 MiB
to decompress a gettext package is acceptable; if not, I certainly
understand that. For comparison, coreutils-9.5.tar.xz has been
compressed with "xz -T1 -8e" and needs 33 MiB of memory to decompress.
Gettext tarball contains identical copies of several files (mostly
from Gnulib). When similar files are spread far apart in the tarball,
LZ77-based compressors cannot deduplicate them unless dictionary size
(=history buffer size) is big enough.
Sorting similar files together helps all compressors. A small package
can be produced with a smaller dictionary size and thus less RAM usage.
Sorting by file basenames is simple and produces good results with
Gettext (it may be bad with other packages; this isn't generic advice):
( find gettext-0.23.1 -type d | LC_ALL=C sort
find gettext-0.23.1 ! -type d -printf '%f\t%p\n' \
| LC_ALL=C sort | cut -f2
) | tar chf - --format=ustar --owner=root --group=root \
--no-recursion --files-from=- \
| xz -T1 -9e > new.tar.xz
Results:
Options Old size New size
-T1 -6e 10.5 MiB 7.6 MiB
-T1 -7e 10.1 MiB 7.4 MiB
-T1 -8e 8.7 MiB 7.1 MiB
-T1 -9e 8.3 MiB 6.9 MiB
-T1 --lzma2=preset=9e,dict=128MiB 7.1 MiB 6.9 MiB
-T1 --lzma2=preset=9e,dict=192MiB 6.8 MiB 6.9 MiB
-T8 -6e 10.9 MiB 8.0 MiB
-T4 -7e 10.4 MiB 7.6 MiB
The downside of sorting is that "tar tf" output isn't pretty.
I see that configure.ac overrides am__tar. Sorting can be put there too:
am__tar='( find "$$tardir" -type d | LC_ALL=C sort ; find "$$tardir" !
-type d -printf '\''%f\t%p\n'\'' | LC_ALL=C sort | cut -f2 ) | ${AMTAR} chf -
--format=ustar --owner=root --group=root --no-recursion --files-from=-'
Bonus: If one uses the long --lzma2 option, appending ",pb=0" helps a
*tiny* amount (like 0.2 % to 0.6 %) with ASCII/UTF-8 text (including
source code tarballs) without downsides (apart from making the command
line uglier). Example:
xz -T1 --lzma2=preset=9e,pb=0
--
Lasse Collin
- Compressing release packages better,
Lasse Collin <=