[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Compressing release packages better
From: |
Bruno Haible |
Subject: |
Re: Compressing release packages better |
Date: |
Wed, 08 Jan 2025 10:45:44 +0100 |
Hi Lasse,
> I noticed that the gettext release .tar.xz files could be smaller. For
> example, gettext-0.23.1.tar.xz is 10.5 MiB but it could be under 7 MiB.
> This obviously isn't an important thing but in case there is interest
> in improving the file size, I wrote a few thoughts below.
Thank you for the suggestions. I never had paid much attention to it
(because today's networks and disks cope well with large files: when people
download a 1.5 hours movie that's already 1 GB or 2 GB).
> Gettext tarball contains identical copies of several files (mostly
> from Gnulib). When similar files are spread far apart in the tarball,
> LZ77-based compressors cannot deduplicate them unless dictionary size
> (=history buffer size) is big enough.
>
> Sorting similar files together helps all compressors. A small package
> can be produced with a smaller dictionary size and thus less RAM usage.
> Sorting by file basenames is simple and produces good results with
> Gettext (it may be bad with other packages; this isn't generic advice):
>
> ( find gettext-0.23.1 -type d | LC_ALL=C sort
> find gettext-0.23.1 ! -type d -printf '%f\t%p\n' \
> | LC_ALL=C sort | cut -f2
> ) | tar chf - --format=ustar --owner=root --group=root \
> --no-recursion --files-from=- \
> | xz -T1 -9e > new.tar.xz
>
> Results:
>
> Options Old size New size
> -T1 -6e 10.5 MiB 7.6 MiB
> -T1 -7e 10.1 MiB 7.4 MiB
> -T1 -8e 8.7 MiB 7.1 MiB
> -T1 -9e 8.3 MiB 6.9 MiB
> -T1 --lzma2=preset=9e,dict=128MiB 7.1 MiB 6.9 MiB
> -T1 --lzma2=preset=9e,dict=192MiB 6.8 MiB 6.9 MiB
>
> -T8 -6e 10.9 MiB 8.0 MiB
> -T4 -7e 10.4 MiB 7.6 MiB
>
> The downside of sorting is that "tar tf" output isn't pretty.
That's a very nice suggestion. I confirm that it helps all compressors:
$ tar chf by-dirname.tar --format=ustar --owner=root --group=root gettext-0.23.1
$ { find gettext-0.23.1 -type d | LC_ALL=C sort; find gettext-0.23.1 ! -type d
-printf '%f\t%p\n' | LC_ALL=C sort | cut -f2; } | tar chf by-basename.tar
--format=ustar --owner=root --group=root --no-recursion --files-from=-
by-dirname by-basename
.tar 157839360 157839360
gzip -9 28592463 26078767
bzip2 20692075 17613661
zstd -3 20748253 14594848
zstd -9 15433711 10619250
zstd -19 12275923 8630896
xz -e 11490324 7958904
lzip -9 8976775 7420203
xz -9 -e 7711432 7286512
> I see that configure.ac overrides am__tar. Sorting can be put there too:
>
> am__tar='( find "$$tardir" -type d | LC_ALL=C sort ; find "$$tardir" !
> -type d -printf '\''%f\t%p\n'\'' | LC_ALL=C sort | cut -f2 ) | ${AMTAR} chf -
> --format=ustar --owner=root --group=root --no-recursion --files-from=-'
Yes. Done in a slightly different way (no subshell, run 'cut' in the C locale
as well) in
https://git.savannah.gnu.org/gitweb/?p=gettext.git;a=commitdiff;h=c8977e01e48e47cbd9e9c6d4538d0cc5e1fb0110
Thanks for the suggestion!
Note that is package-specific. For instance, I think the *.po files sort
more efficiently in the by-dirname order, but in the gettext tarball the
copied files are more dominant.
> Admin/release-steps mentions "xz -c -e" and gettext-0.23.1.tar.xz
> matches this. "xz -e" is equivalent to "xz -6e". The -e makes it slower
> while keeping compressor and decompressor memory usage the same as
> "xz -6". With this tarball, going from -6 to -6e reduces file size by
> 1.7 % while making compression take 70 % longer.
>
> Higher settings use more RAM for both compression and decompression.
> xz -6 and -6e use 8 MiB dictionary, -9 and -9e use 64 MiB, but one can
> go higher which happens to help unusually much with this package:
>
> Options Size Comp.Mem. Decomp.Mem.
> -T1 -6e 10.5 MiB 94 MiB 9 MiB
> -T1 -9e 8.3 MiB 674 MiB 65 MiB
> -T1 --lzma2=preset=9e,dict=128MiB 7.1 MiB 1346 MiB 129 MiB
> -T1 --lzma2=preset=9e,dict=192MiB 6.8 MiB 2082 MiB 193 MiB
> ...
> Automake defaults to "xz -e". I think it was sensible in 2010-2011. (See
> the Automake commit c8e01d581a7e.) What is considered an old or low-end
> machine is quite different nowadays. I don't know if requiring 193 MiB
> to decompress a gettext package is acceptable
Decompression memory requirements still matter, though. For example:
- Embedded Linux systems often have only 256 MB of RAM.
- I occasionally use a laptop with 1 GiB of RAM, or a smartphone with
2.75 GiB of RAM.
- In cloud environments, the price is proportional to the RAM size.
Therefore it is not unusual to work with VMs with 0.5 GiB of RAM.
> For comparison, coreutils-9.5.tar.xz has been
> compressed with "xz -T1 -8e" and needs 33 MiB of memory to decompress.
A decompression memory requirement of 33 MiB is reasonable, I would say.
> Bonus: If one uses the long --lzma2 option, appending ",pb=0" helps a
> *tiny* amount (like 0.2 % to 0.6 %) with ASCII/UTF-8 text (including
> source code tarballs) without downsides (apart from making the command
> line uglier). Example:
>
> xz -T1 --lzma2=preset=9e,pb=0
I try to avoid options here that few people use, so as to minimize the
risk of running into trouble. Reducing the .xz size from 11 MB to 8 MB
is good enough; I don't need further tuning if it comes with some risks.
Bruno