X-alpha hexadecimal notation

The most common way to represent hexadecimal (or any other base > 10) numbers is to use the first letters of alphabet for the extra digits. However, this doesn’t work well for my brain that insists that since A is the first letter, B is the second letter… then A = 10 + 1, B = 10 + 2… so I keep having to remember to shift this by one, and judging by the responses to my toot about it, it seems that I’m not alone.

I don’t think that I’ve made any useless invention that people would randomly find and say “oh, hey, what a nice unrealistic idea”. It’s time to make one! I present to you: the X-alpha hexadecimal notation!
Continue reading “X-alpha hexadecimal notation”

Naming standards compliance of PEP517 backends

PyPA maintains two standards regarding packaging artifact filenames:

I have decided to give a few popular PEP 517 backends a go and see whether they follow the standards.

Continue reading “Naming standards compliance of PEP517 backends”

The inconsistencies around Python package naming and the new policy

For a long time, the dev-python category in Gentoo did not follow any specific naming policy. Usually we went for what made the ebuild easier — the GitHub project name, if we happened to be using GitHub archives as distfiles, or PyPI project name when using source distributions from PyPI. However, this was inconvenient for users who had a hard time finding specific packages. Historically, we even had cases of developers independently adding a second copy of the same package with different name.

This is why I eventually started researching the standards for Python package naming, and drafting a new policy. The package name policy can now be found in the Gentoo Python Guide. In this post, I’d like to summarize the research that led to forming it, and the problems that we are to face yet.

Continue reading “The inconsistencies around Python package naming and the new policy”

.tar sorting vs .xz compression ratio

It is a pretty common knowledge that ordering of members within archive can affect the compression ratio. I’ve done some quick testing and the results somewhat surprised me. Firstly, it turned out that the simplest lexical sorting by name (path) gave the best result. Secondly, because it turned out that the difference between that and sorting by size was as large as 8%.

Note that this is a pretty specific source archive, so results may vary. Test details and commands in the remainder of the post.

Compression results per sort order
Sort order Size in bytes Compared to best
name 108 011 756 100.00%
suffix 108 573 612 100.52%
size (smallest first) 116 797 440 108.13%
size (largest first) 116 645 940 108.00%
suffix + size 111 709 128 103.42%

The conclusion? Sorting can affect compression ratio more than I have anticipated. However, all the “obvious” optimizations have made the result worse than plain lexical sorting. Perhaps it’s just the matter of well-organized source code keeping similar files in the same directories. Perhaps there is a way to optimize it even more (and beat sorting by name). One interesting option would be to group files by bucket sizes, and then sort by name.

Special thanks to Adrien Nader and Lasse Collin from #tukaani for inspiring me to do this.

Continue reading “.tar sorting vs .xz compression ratio”