Why you can’t rely on repository format (PMS)

You should know already that you are not supposed to rely on Portage internals in ebuilds — all variables, functions and helpers that are not defined by the PMS. You probably know that you are not supposed to touch various configuration files, vdb and other Portage files as well. What most people don’t seem to understand, you are not supposed to make any assumptions about the ebuild repository either. In this post, I will expand on this and try to explain why.

What PMS specifies, what you can rely on

I think the first confusing point is that PMS actually defines the repository format pretty thoroughly. However, it does not specify that you can rely on that format being visible from within ebuild environment. It just defines a few interfaces that you can reliably use, some of them in fact quite consistent with the repository layout.

You should really look as the PMS-defined repository format as an input specification. This is the format that the developers are supposed to use when writing ebuilds, and that all basic tools are supposed to support. However, it does not prevent the package managers from defining and using other package formats, as long as they provide the environment compliant with the PMS.

In fact, this is how binary packages are implemented in Gentoo. The PMS does not define any specific format for them. It only defines a few basic rules and facilities, and both Portage and Paludis implement their own binary package formats. The package managers expose APIs required by the PMS, and can use them to run the necessary pkg_* phases.

However, the problem is not limited to two currently used binary package formats. This is a generic goal of being able to define any new package format in the future, and make it work out of the box with existing ebuilds. Imagine just a few possibilities: more compact repository formats (i.e. not requiring hundreds of unpacked files), fetching only needed ebuild files…

Sadly, none of this can even start being implemented if developers continuosly insist to rely on specific repository layout.

The *DIR variables

Let’s get into the details and iterate over the few relevant variables here.

First of all, FILESDIR. This is the directory where ebuild support files are provided throughout src_* phases. However, there is no guarantee that this will be exactly the directory you created in the ebuild repository. The package manager just needs to provide the files in some directory, and this directory may not actually exist before the first src_* phase. This implies that the support files may not even exist at all when installing from a binary package, and may be created (copied, unpacked) later when doing a source build.

The next variable listed by the PMS is DISTDIR. While this variable is somewhat similar to the previous one, some developers are actually eager to make the opposite assumption. Once again, the package manager may provide the path to any directory that contains the downloaded files. This may be a ‘shadow’ directory containing only files for this package, or it can be any system downloads directory containing lots of other files. Once again, you can’t assume that DISTDIR will exist before src_*, and that it will exist at all (and contain necessary files) when the build is performed using a binary package.

The two remaining variables I would like to discuss are PORTDIR and ECLASSDIR. Those two are a cause of real mayhem: they are completely unsuited for a multi-repository layout modern package managers use and they enforce a particular source repository layout (they are not available outside src_* phases). They pretty much block any effort on improvement, and sadly their removal is continuously blocked by a few short-sighted developers. Nevertheless, work on removing them is in progress.

Environment saving

While we’re discussing those matters, a short note on environment saving is worth being written. By environment saving we usually mean the magic that causes the variables set in one phase function to be carried to a phase function following it, possibly over a disjoint sequence of actions (i.e. install followed by uninstall).

A common misunderstanding is to assume the Portage model of environment saving — i.e. basically dumping a whole ebuild environment including functions into a file. However, this is not sanctioned by the PMS. The rules require the package manager to save only variables, and only those that are not defined in global scope. If phase functions define functions, there is no guarantee that those functions will be preserved or restored. If phases redefine global variables, there is no guarantee that the redefinition will be preserved.

In fact, the specific wording used in the PMS allows a completely different implementation to be used. The package manager may just snapshot defined functions after processing the global scope, or even not snapshot them at all and instead re-read the ebuild (and re-inherit eclasses) every time the execution continues. In this case, any functions defined during phase function are lost.

Is there a future in this?

I hope this clears up all the misunderstandings on how to write ebuilds so that they will work reliably, both for source and binary builds. If those rules are followed, our users can finally start expecting some fun features to come. However, before that happens we need to fix the few existing violations — and for that to happen, we need a few developers to stop thinking only of their own convenience.

What PMS specifies, what you can rely on

The *DIR variables

Environment saving

Is there a future in this?

Leave a Reply Cancel reply