Gentoo – Page 17 – Michał Górny

PyPy is back, and for real this time!

As you may recall, I was looking for a dedicated PyPy maintainer for quite some time. Sadly, all the people who helped (and who I’d like to thank a lot) ended up lacking time soon enough. So finally I’ve decided to look into the hacks reducing build-time memory use and take care of the necessary ebuild and packaging work myself.

Continue reading “PyPy is back, and for real this time!”

Bash pitfalls: globbing everywhere!

Bash has many subtle pitfalls, some of them being able to live unnoticed for a very long time. A common example of that kind of pitfall is ubiquitous filename expansion, or globbing. What many script writers forget about to notice is that practically anything that looks like a pattern and is not quoted is subject to globbing, including unquoted variables.

There are two extra snags that add up to this. Firstly, many people forget that not only asterisks (*) and question marks (?) make up patterns — square brackets ([) do that as well. Secondly, by default bash (and POSIX shell) take failed expansions literally. That is, if your glob does not match any file, you may not even know that you are globbing.

It’s all just a matter of running in the proper directory for the result to change. Of course, it’s often unlikely — maybe even close to impossible. You can work towards preventing that by running in a safe directory. But in the end, writing predictable software is a fine quality.

How to notice mistakes?

Bash provides a two major facilities that could help you stop mistakes — shopts nullglob and failglob.

The nullglob option is a good choice for a default for your script. After enabling it, failing filename expansions result in no parameters rather than verbatim pattern itself. This has two important implications.

Firstly, it makes iterating over optional files easy:

for f in a/* b/* c/*; do
    some_magic "${f}"
done

Without nullglob, the above may actually return a/* if no file matches the pattern. For this reason, you would need to add an additional check for existence of file inside the loop. With nullglob, it will just ‘omit’ the unmatched arguments. In fact, if none of the patterns match the loop won’t be run even once.

Secondly, it turns every accidental glob into null. While this isn’t the most friendly warning and in fact it may have very undesired results, you’re more likely to notice that something is going wrong.

The failglob option is better if you can assume you don’t need to match files in its scope. In this case, bash treats every failing filename expansion as a fatal error and terminates execution with an appropriate message.

The main advantage of failglob is that it makes you aware of any mistake before someone hits it the hard way. Of course, assuming that it doesn’t accidentally expand into something already.

There is also a choice of noglob. However, I wouldn’t recommend it since it works around mistakes rather than fixing them, and makes the code rely on a non-standard environment.

Word splitting without globbing

One of the pitfalls I myself noticed lately is the attempt of using unquoted variable substitution to do word splitting. For example:

for i in ${v}; do
    echo "${i}"
done

At a first glance, everything looks fine. ${v} contains a whitespace-separated list of words and we iterate over each word. The pitfall here is that words in ${v} are subject to filename expansion. For example, if a lone asterisk would happen to be there (like v='10 * 4'), you’d actually get all files in the current directory. Unexpected, isn’t it?

I am aware of three solutions that can be used to accomplish word splitting without implicit globbing:

setting shopt -s noglob locally,
setting GLOBIGNORE='*' locally,
using the swiss army knife of read to perform word splitting.

Personally, I dislike the first two since they require set-and-restore magic, and the latter also has the penalty of doing the globbing then discarding the result. Therefore, I will expand on using read:

read -r -d '' -a words <<<"${v}"
for i in "${words[@]}"; do
    echo "${i}"
done

While normally read is used to read from files, we can use the here string syntax of bash to feed the variable into it. The -r option disables backslash escape processing that is undesired here. -d '' causes read to process the whole input and not stop at any delimiter (like newline). -a words causes it to put the split words into array ${words[@]} — and since we know how to safely iterate over an array, the underlying issue is solved.

The Council and the Community

A new Council election is in progress and we have a few candidates. Most of them have written a manifesto. For some of them this is one of the few mails they sent to the public mailing lists recently. For one of them this is the only one. Do we want to elect people who do not participate actively in the Community? Does such election even make sense?

Continue reading “The Council and the Community”

Inlining -march=native for distcc

-march=native is a gcc flag that enables auto-detection of CPU architecture and properties. Not only it allows you to avoid finding the correct value of -march= but also enables instruction sets that do not fit any standard CPU profile and detects the cache sizes.

Sadly, -march=native itself can’t really work well with distcc. Since the detection is performed when compiling, remote gcc invocations would use the architecture of the distcc host rather than the client. Therefore, the resulting executables would be a mix of different architectures used by distcc.

You may also find -march=native a bit opaque. For example, we had multiple bug reports about LLVM failing to build with -march=atom. However, some of the reporters were using -march=native, so we wasn’t able to immediately identify the duplicates.

In this article, I will guide you shortly on replacing -march=native with expanded compiler flags, for the benefit of distcc compatibility and more explicit build logs.

Continue reading “Inlining -march=native for distcc”

Reducing SquashFS delta size through partial decompression

In a previous article titled ‘using deltas to speed up SquashFS ebuild repository updates’, the author has considered benefits of using binary deltas to update SquashFS images. The proposed method has proven very efficient in terms of disk I/O, memory and CPU time use. However, the relatively large size of deltas made network bandwidth a bottleneck.

The rough estimations done at the time proved that this is not a major issue for a common client with a moderate-bandwidth link such as ADSL. Nevertheless, the size is an inconvenience both to clients and to mirror providers. Assuming that there is an upper bound on disk space consumed by snapshots, the extra size reduces the number of snapshots stored on mirrors, and therefore shortens the supported update period.

The most likely cause for the excessive delta size is the complexity of correlation between input and compressed output. Changes in input files are likely to cause much larger changes in the SquashFS output that the tested delta algorithms fail to express efficiently.

For example, in the LZ family of compression algorithms, a change in input stream may affect the contents of the dictionary and therefore the output stream following it. In block-based compressors such as bzip2, a change in input may shift all the following data moving it across block boundaries. As a result, the contents of all the blocks following it change, and therefore the compressed output for each of them.

Since SquashFS splits the input into multiple blocks that are compressed separately, the scope of this issue is much smaller than in plain tarballs. Nevertheless, small changes occurring in multiple blocks are able to grow delta two to four times as large as it would be if the data was not compressed. In this paper, the author explores the possibility of introducing a transparent decompression in the delta generation process to reduce the delta size.

Read on… [PDF]