.tar sorting vs .xz compression ratio

It is a pretty common knowledge that ordering of members within archive can affect the compression ratio. I’ve done some quick testing and the results somewhat surprised me. Firstly, it turned out that the simplest lexical sorting by name (path) gave the best result. Secondly, because it turned out that the difference between that and sorting by size was as large as 8%.

Note that this is a pretty specific source archive, so results may vary. Test details and commands in the remainder of the post.

Compression results per sort order
Sort order Size in bytes Compared to best
name 108 011 756 100.00%
suffix 108 573 612 100.52%
size (smallest first) 116 797 440 108.13%
size (largest first) 116 645 940 108.00%
suffix + size 111 709 128 103.42%

The conclusion? Sorting can affect compression ratio more than I have anticipated. However, all the “obvious” optimizations have made the result worse than plain lexical sorting. Perhaps it’s just the matter of well-organized source code keeping similar files in the same directories. Perhaps there is a way to optimize it even more (and beat sorting by name). One interesting option would be to group files by bucket sizes, and then sort by name.

Special thanks to Adrien Nader and Lasse Collin from #tukaani for inspiring me to do this.

Continue reading “.tar sorting vs .xz compression ratio”

Clang in Gentoo now sets default runtimes via config file

The upcoming clang 16 release features substantial improvements to configuration file support. Notably, it adds support for specifying multiple files and better default locations. This enabled Gentoo to finally replace the default-* flags used on sys-devel/clang, effectively empowering our users with the ability to change defaults without rebuilding whole clang.

This change has also been partially backported to clang 15.0.2 in Gentoo, and (unless major problems are reported) will be part of the stable clang 15.x release (currently planned for upcoming 15.0.3).

In this post, I’d like to shortly describe the new configuration file features, how much of them have been backported to 15.x in Gentoo and how defaults are going to be selected from now on.
Continue reading “Clang in Gentoo now sets default runtimes via config file”

The future of Python build systems and Gentoo

Anyone following my Twitter could have seen me complaining about things happening around Python build systems frequently. The late changes feel like people around the Python packaging ecosystem have been strongly focused on building a new infrastructure focused on Python-specific package manages such as pip and flit. Unfortunately, there seems to be very little concern on distribution packagers or backwards compatibility in this process.

In this post, I’d like to discuss how the Python packaging changes are going to affect Gentoo, and what is my suggested plan on dealing with them. In particular, I’d like to focus on three important changes:

  1. Python upstream deprecating the distutils module (and build system), and planning to remove it in Python 3.12.
  2. The overall rise of PEP 517-based build systems and the potential for setuptools dropping UI entirely.
  3. Setuptools upstream deprecating the setup.py install command, and potentially removing it in the future.

Continue reading “The future of Python build systems and Gentoo”

The stablereq workflow for Python packages

I have been taking care of periodic mass stabilization of Python packages in Gentoo for some time already. Per Guilherme Amadio‘s suggestion, I’d like to share the workflow I use for this. I think it could be helpful to others dealing with large sets of heterogeneous packages.

The workflow requires:

app-portage/mgorny-dev-scripts, v10

Continue reading “The stablereq workflow for Python packages”