The inconsistencies around Python package naming and the new policy

For a long time, the dev-python category in Gentoo did not follow any specific naming policy. Usually we went for what made the ebuild easier — the GitHub project name, if we happened to be using GitHub archives as distfiles, or PyPI project name when using source distributions from PyPI. However, this was inconvenient for users who had a hard time finding specific packages. Historically, we even had cases of developers independently adding a second copy of the same package with different name.

This is why I eventually started researching the standards for Python package naming, and drafting a new policy. The package name policy can now be found in the Gentoo Python Guide. In this post, I’d like to summarize the research that led to forming it, and the problems that we are to face yet.

Python module, package and distribution names

The rules for Python module names are defined in the documentation of the syntax of the import statement. It basically indicates that the module name must be a valid identifier. It’s scary Unicode stuff. However, the rough idea is that it should start with a latter or a underscore, followed by more letters, digits or underscores — or something like that.

Then, PEP 8 recommends:

Modules should have short, all-lowercase names. Underscores can be used in the module name if it improves readability. Python packages should also have short, all-lowercase names, although the use of underscores is discouraged.

PEP 8 – Style Guide for Python Code: Package nad Module Names

As Python packages are published on PyPI, they become accessible via PyPI project names. The current rules for package names are described in the Package name normalization specification:

Valid non-normalized names

A valid name consists only of ASCII letters and numbers, period, underscore and hyphen. It must start and end with a letter or number. […]

Normalization

The name should be lowercased with all runs of the characters ., -, or _ replaced with a single - character. […]

Package name normalization

Now, let’s look at source distribution names. They were normalized in PEP 625. It indicates that the distribution name (i.e. the package name) should be normalized according to the wheel specification. This spec says:

In distribution names, any run of -_. characters (HYPHEN-MINUS, LOW LINE and FULL STOP) should be replaced with _ (LOW LINE), and uppercase characters should be replaced with corresponding lowercase ones. This is equivalent to regular name normalization followed by replacing - with _. Tools consuming wheels must be prepared to accept . (FULL STOP) and uppercase letters, however, as these were allowed by an earlier version of this specification.

Binary distribution format, Escaping and Unicode

So, to summarize. Baroque module names aside, package names consist of lowercase letters, uppercase letters, digits and the three symbols: -_.. Package matching (e.g. when resolving dependencies) is done using normalized names, that is turned into lowercase and with runs of special symbols replaced by hyphens. However, e.g. PEP 503 says that:

Repositories MAY redirect unnormalized URLs to the canonical normalized URL […], however clients MUST NOT rely on this redirection and MUST request the normalized URL.

PEP 503: Simple Repository API, Specification

Source distribution names originally followed project names. However, the recent standards require that they are normalized instead, following almost the same normalization rules — except that underscores are used in place of hyphens.

Finally, PEP 423 is worth a honorary mention as it attempted to provide good guidelines for naming packages. However, it was deferred.

Now, that we’re past Python standards let’s see how all that affects Gentoo.

Gentoo package names

Valid Gentoo package names are specified as:

A package name may contain any of the characters [A-Za-z0-9+_-]. It must not begin with a hyphen or a plus sign, and must not end in a hyphen followed by anything matching the version syntax […].

Package Manager Specification (as of EAPI 8), 3.1.2 Package names

There are two incompatibilities with Python package names here: dots are not allowed, and anything matching the version syntax is not allowed at the end of the name. However, the latter are pretty rare.

The primary goal behind the new policy was to make the Gentoo package names predictable while leaving reasonable flexibility for the developers. A secondary goal was to be able to allow for better naming consistency in face of upstream inconsistency — for example, while many Flask packages are using titlecase names, <a rel="external" href="https://pypi.org/project/flask-babel has recently switched to lowercase naming. On top of that, Gentoo developers tend to prefer lowercase names themselves.

Hence, the policy requires that normalized upstream names match Gentoo package names after performing an equivalent normalization. It also recommends using consistent naming rules within package groups, and requires replacing dots with hyphens.

It’s not perfect but should be good enough for the usual case-insensitive package search. However, this is not where the problems end.

Modern and legacy PyPI download URLs

Quite some time ago PyPI switched to using “hashed” download URLs. That is, every source distribution is normally accessed by an URL such as:

https://files.pythonhosted.org/packages/20/2e/36e46173a288c1c40853ffdb712c67e0e022df0e1ce50b7b1b50066b74d4/gpep517-13.tar.gz

This would obviously be a major pain for Gentoo, since the developers would have to explicitly store the hash in SRC_URI and update it for every version bump. Therefore, we have stayed with “legacy” URLs that continue working to this day:

https://files.pythonhosted.org/packages/source/g/gpep517/gpep517-13.tar.gz

In this URL, the distfile path consists of three parts: the first letter of project name, the project name and the filename. The project name matching is subject to normalization but the filename must match exactly. So for example, the “canonical” URL for jupyter-server would be:

https://files.pythonhosted.org/packages/source/j/jupyter-server/jupyter_server-2.2.1.tar.gz

However, the following also works:

https://files.pythonhosted.org/packages/source/j/jupyter_server/jupyter_server-2.2.1.tar.gz

But using jupyter-server-2.2.1.tar.gz won’t work!

At this point, I’m wondering: should we actually be using the original project name (i.e. jupyter-server) or the project name normalized for sdist names (i.e. jupyter_server)?

pypi.eclass comes into the picture

For a long time, PyPI distfiles were referenced using constructs similar to the following:

SRC_URI="mirror://pypi/${PN::1}/${PN}/${P}.tar.gz"

However, this had two disadvantages. Firstly, it was convenient when the package name needed to be transformed. This lead to ugly-ish constructs like:

MY_P=${P^}
SRC_URI="mirror://pypi/${MY_P::1}/${PN^}/${MY_P}.tar.gz"
S=${WORKDIR}/${MY_P}

(or even worse, with an inline uppercase F there!)

Secondly, it meant trouble if the legacy URLs ever stopped working and had to be replaced by something using different syntax.

To address both of these concerns and make addressing PyPI source distributions easier, pypi.eclass was introduced. It sets a default SRC_URI that’s suitable for exact package name and version match, and provides helper functions for generating other URLs. These functions (optionally) take project name and version as arguments, so the latter could be rewritten as:

inherit pypi

SRC_URI="$(pypi_sdist_url "${PN^}")"
S=${WORKDIR}/${P^}

Not a major gain but at least abstracts away that annoying “first letter” part.

When I’ve designed the eclass, I didn’t take into consideration the different rules applying to project and source distribution names. Technically, the URL for jupyter-server package should be:

SRC_URI="mirror://pypi/${PN::1}/${PN}/${P//-/_}.tar.gz"
S=${WORKDIR}/${P//-/_}

However, pypi.eclass can’t express that (right now)! Good news is that it’s not that a big deal — if we just give ${PN//-/_}, the URL will work anyway thanks to normalized project name matching. Now, this isn’t required to work according to the Simple Repository API — but then, legacy URLs aren’t guaranteed to work at all, so we can probably put both into the same bag.

It feels a bit bad though.

Summary

Python packaging is improving, or at least trying to improve. What used to be a number of conventions that weren’t always followed through is being replaced by a maze of standards, specifications and more standards.

We have standards for package names, and standards for source distribution names. They don’t necessarily wholly agree, they don’t guarantee consistency but it’s a historical baggage we have to live with. We have to figure out how to make the best out of it.

We have a new policy that tries to reasonably follow PyPI naming to make it easier to find Gentoo counterparts of PyPI packages. At the same time, we are trying to fit upstream names into Gentoo naming, and introduce order and consistency where it’s missing. We have a new eclass that was supposed to make fetching artifacts from PyPI easier.

Unfortunately, it’s all still unclear how should we proceed. Should we lowercase all package names, to avoid inconsistency like when Flask-Babel was suddenly renamed to flask-babel? Should we make the pypi.eclass default to normalized filenames and require overrides for most of the Python build systems that generate non-normalized artifacts (sigh), or should require explicit overrides when that actually happens?

How can we sleep soundly with so much disorder in the world!?

Leave a Reply

Your email address will not be published.