One jobserver to rule them all

A common problem with running Gentoo builds is concurrency. Many packages include extensive build steps that are either fully serial, or cannot fully utilize the available CPU threads throughout. This problem becomes less pronounced when running building multiple packages in parallel, but then we are risking overscheduling for packages that do take advantage of parallel builds.

Fortunately, there are a few tools at our disposal that can improve the situation. Most recently, they were joined by two experimental system-wide jobservers: guildmaster and steve. In this post, I’d like to provide the background on them, and discuss the problems they are facing.

The job multiplication problem

You can use the MAKEOPTS variable to specify a number of parallel jobs to run:

MAKEOPTS="-j12"

This is used not only by GNU make, but it is also recognized by a plethora of eclasses and ebuilds, and converted into appropriate options for various builders, test runners and other tools that can benefit from concurrency. So far, that’s good news; whenever we can, we’re going to run 12 jobs and utilize all the CPU threads.

The problems start when we’re running multiple builds in parallel. This could be either due to running emerge --jobs, or simply needing to start another emerge process. The latter happens to me quite often, as I am testing multiple packages simultaneously.

For example, if we end up building four packages simultaneously, and all of them support -j, we may end up spawning 48 jobs. The issue isn’t just saturating the CPU; imagine you’re running 48 memory-hungry C++ compilers simultaneously!

Load-average scheduling to the rescue

One possible workaround is to use the --load-average option, e.g.:

MAKEOPTS="-j12 -l13"

This causes tools supporting the option not to start new jobs if the current load exceeds 13, which roughly approximates 13 processes running simultaneously. However, the option isn’t universally supported, and the exact behavior differs from tool to tool. For example, CTest doesn’t start any jobs when the load is exceeded, effectively stopping test execution, whereas GNU make and Ninja throttle themselves down to one job.

Of course, this is a rough approximation. While GNU make attempts to establish the current load from /proc/loadavg, most tools just use the one-minute average from getloadavg(), suffering from some lag. It is entirely possible to end up with interspersed periods of overscheduling while the load is still ramping up, followed by periods of underscheduling before it decreases again. Still, it is better than nothing, and can become especially useful for providing background load for other tasks: a build process that can utilize the idle CPU threads, and back down when other builds need them.

The nested Makefile problem and GNU Make jobserver

Nested Makefiles are processed by calling make recursively, and therefore face a similar problem: if you run multiple make processes in parallel, and they run multiple jobs simultaneously, you end up overscheduling. To avoid this, GNU make introduces a jobserver. It ensures that the specified job number is respected across multiple make invocations.

At the time of writing, GNU make supports three kinds of the jobserver protocol:

  1. The legacy Unix pipe-based protocol that relied on passing file descriptors to child processes.
  2. The modern Unix protocol using a named pipe.
  3. The Windows protocol using a shared semaphore.

All these variants follow roughly the same design principles, and are peer-to-peer protocols for using shared state rather than true servers in the network sense. The jobserver’s role is mostly limited to initializing the state and seeding it with an appropriate number of job tokens. Afterwards, clients are responsible for acquiring a token whenever they are about to start a job, and returning it once the job finishes. The availability of job tokens therefore limits the total number of processes started.

The flexibility of modern protocols permitted more tools to support them. Notably, the Ninja build system recently started supporting the protocol, therefore permitting proper parallelism in complex build systems combining Makefiles and Ninja. The jobserver protocol is also supported by Cargo and various Rust tools, GCC and LLVM, where it can be used to limit the number of parallel LTO jobs.

A system-wide jobserver

With a growing number of tools becoming capable of parallel processing, and at the same time gaining support for the GNU make jobserver protocol, it starts being an interesting solution to the overscheduling problem. If we could run one jobserver shared across all build processes, we could control the total number of jobs running simultaneously, and therefore have all the simultaneously running builds dynamically adjust one to another!

In fact, this is not a new idea. A bug requesting jobserver integration has been filed for Portage back in 2019. NixOS jobserver effort dates back at least to 2021, though it has not been merged yet. Guildmaster and steve joined the effort very recently.

There are two primary problems with using a system-wide jobserver: token release reliability, and the “implicit slot” problem.

The token release problem

The first problem is more important. As noted before, the jobserver protocol relies entirely on clients releasing the job tokens they acquired, and the documentation explicitly emphasizes that they must be returned even in error conditions. Unfortunately, this is not always possible: if the client gets killed, it cannot run any cleanup code and therefore return the tokens! For scoped jobservers like GNU make’s this usually isn’t that much of a problem, since make normally terminates upon a child being killed. However, a system jobserver could easily be left with no job tokens in the queue this way!

This problem cannot really be solved within the strict bounds of the jobserver protocol. After all, it is just a named pipe, and there are limits to how much you can monitor what’s happening to the pipe buffer. Fortunately, there is a way around that: you can implement a proper server for the jobserver protocol using FUSE, and provide it in place of the named pipe. Good news is, most of the tools don’t actually check the file type, and these that do can easily be patched.

The current draft of NixOS jobserver provides a regular file with special behavior via FUSE, whereas guildmaster and steve both provide a character device via its CUSE API. NixOS jobserver and guildmaster both return unreleased tokens once the process closes the jobserver file, whereas steve returns them once the process acquiring them exits. This way, they can guarantee that a process that either can’t release its tokens (e.g. because it’s been killed), or one that doesn’t because of implementation issue (e.g. Cargo), doesn’t end up effectively locking other builds. It also means we can provide live information on which processes are holding the tokens, or even implement additional features such as limiting token provision based on the system load, or setting per-process limits.

The implicit slot problem

The second problem is related to the implicit assumption that a jobserver is inherited from a parent GNU make process that already acquired a token to spawn the subprocess. Since the make subprocess doesn’t really do any work itself, it can “use” the token to spawn another job instead. Therefore, every GNU make process running under a jobserver has one implicit slot that runs jobs without consuming any tokens. If the jobserver is running externally and no job tokens were acquired while running the top make process, it ends up running an extra process without a job token: so steve -j12 permits 12 jobs, plus one extra job for every package being built.

Fortunately, the solution is rather simple: one needs to implement token acquisition at Portage level. Portage acquires a new token prior to starting a build job, and releases it once the job finishes. In fact, this solves two problems: it accounts for the implicit slot in builders implementing the jobserver protocol, and it limits the total number of jobs run for parallel builds.

However, this is a double-edged sword. On one hand, it limits the risk of overscheduling when running parallel build jobs. On the other, it means that a new emerge job may not be able to start immediately, but instead wait for other jobs to free up job tokens first, negatively affecting interactivity.

A semi-related issue is that acquiring a single token doesn’t properly account for processes that are parallel themselves but do not implement the jobserver protocol, such as pytest-xdist runs. It may be possible to handle these better by acquiring multiple tokens prior to running them (or possibly while running them), but in the former case one needs to be careful to acquire them atomically, and not end up with the equivalent of lock contention: two processes acquiring part of the tokens they require, and waiting forever for more.

The implicit slot problem also causes issues in other clients. For example, nasm-rs writes an extra token to the jobserver pipe to avoid special-casing the implicit slot. However, this violates the protocol and breaks clients with per-process tokens. Steve carries a special workaround for that package.

Summary

A growing number of tools is capable of some degree of concurrency: from builders traditionally being able to start multiple parallel jobs, to multithreaded compilers. While they provide some degree of control over how many jobs to start, avoiding overscheduling while running multiple builds in parallel is non-trivial. Some builders can use load average to partially mitigate the issue, but that’s far from a perfect solution.

Jobservers are our best bet right now. Originally designed to handle job scheduling for recursive GNU make invocations, they are being extended to control other parallel processes throughout the build, and can be further extended to control the job numbers across different builds, and even across different build containers.

While NixOS seems to have dropped the ball, Gentoo is now finally actively pursuing global jobserver support. Guildmaster and steve both prove that the server-side implementation is possible, and integration is just around the corner. At this point, it’s not clear whether a jobserver-enabled systems are going to become the default in the future, but certainly it’s an interesting experiment to carry.

Leave a Reply

Your email address will not be published.