Reducing SquashFS delta size through partial decompression

In a previous article titled ‘using deltas to speed up SquashFS ebuild repository updates’, the author has considered benefits of using binary deltas to update SquashFS images. The proposed method has proven very efficient in terms of disk I/O, memory and CPU time use. However, the relatively large size of deltas made network bandwidth a bottleneck.

The rough estimations done at the time proved that this is not a major issue for a common client with a moderate-bandwidth link such as ADSL. Nevertheless, the size is an inconvenience both to clients and to mirror providers. Assuming that there is an upper bound on disk space consumed by snapshots, the extra size reduces the number of snapshots stored on mirrors, and therefore shortens the supported update period.

The most likely cause for the excessive delta size is the complexity of correlation between input and compressed output. Changes in input files are likely to cause much larger changes in the SquashFS output that the tested delta algorithms fail to express efficiently.

For example, in the LZ family of compression algorithms, a change in input stream may affect the contents of the dictionary and therefore the output stream following it. In block-based compressors such as bzip2, a change in input may shift all the following data moving it across block boundaries. As a result, the contents of all the blocks following it change, and therefore the compressed output for each of them.

Since SquashFS splits the input into multiple blocks that are compressed separately, the scope of this issue is much smaller than in plain tarballs. Nevertheless, small changes occurring in multiple blocks are able to grow delta two to four times as large as it would be if the data was not compressed. In this paper, the author explores the possibility of introducing a transparent decompression in the delta generation process to reduce the delta size.

Read on… [PDF]

Using deltas to speed up SquashFS updates

The ebuild repository format that is used by Gentoo generally fits well in the developer and power user work flow. It has a simple design that makes reading, modifying and adding ebuilds easy. However, the large number of separate small files with many similarities do not make it very space efficient and often impacts performance. The update (rsync) mechanism is relatively slow compared to distributions like Arch Linux, and is only moderately bandwidth efficient.

There were various attempts at solving at least some of those issues. Various filesystems were used in order to reduce the space consumption and improve performance. Delta updates were introduced through the emerge-delta-webrsync tool to save bandwidth. Sadly, those solutions usually introduce other inconveniences.

Using a separate filesystem for the repositories involves additional maintenance. Using a read-only filesystem makes updates time-consuming. Similarly, the delta update mechanism — while saving bandwidth — usually takes more time than plain rsync update.

In this article, the author proposes a new solution that aims both to save disk space and reduce update time significantly, bringing Gentoo closer to the features of binary distributions. The ultimate goal of this project would be to make it possible to use the package manager efficiently without having to perform additional administrative tasks such as designating an extra partition.

Read on… [PDF]

Why there’s no IUSE=systemd, logrotate, bash-completion…

Next tides of users slowly notice that a number of unneeded files is installed on their systems. They enumerate systemd unit files, logrotate files, take their pitchforks and start their cruciates against Gentoo developers wasting their precious disk space.

Let me tell you a story. The story starts when Uncle Scarabeus wants to add bash-completion support into libreoffice ebuild. He considers this a minor addon, not worth the half a day necessary to rebuild libreoffice, so he doesn’t revbump it. He simply assumes the change will be propagated nicely when users upgrade to the next version.

Of course, some users will already come shouting here: that’s against the policy! Yes, indeed it is. But is it worth the hassle? Should all libreoffice users be forced to rebuild that huge package just to get a single tiny file installed? He could wait and add that along with the next version. Well, if he wouldn’t forget about it.

But that’s not really important part here. Because, to his surprise, many users have actually noticed the change. That’s because the use of bash-completion.eclass has caused the ebuild to have IUSE=bash-completion; and many of the --newuse Portage option users have rebuilt the package. A few others, like me, just stopped using that option.

That’s when the discussion started. We — the few devs actually caring about discussing — decided that it is quite pointless to control installing tiny files through USEflags. Of course, the libreoffice is an elephant-case here but so-called regular packages aren’t much better here. Is there really a reason to rebuild even 10 C files when the only thing going to change is a single, tiny text file being installed or not?

Another solution is to split those files into separate ebuilds. But that’s usually inconvenient both for users and devs. Users have to notice that they need to emerge an additional package to get the particular file installed, and devs need to maintain that additional package. That starts to become really ridiculous with files like systemd units which are often generated during build-time and store installation paths.

So what to use? INSTALL_MASK, obviously. It’s an ancient Portage magic which allows you to control which files will be punted from installed files. You can use app-portage/install-mask to quickly set it for the files you don’t want. It’s as simple as:

# install-mask -a systemd logrotate