A format that does one thing well or one-size-fits-all?

The Unix philosophy states that we ought to design programs that “do one thing well”. Nevertheless, the current trend is to design huge monoliths with multiple unrelated functions, with web browsers at the peak of that horrifying journey. However, let’s consider something else.

Does the same philosophy hold for algorithms and file formats? Is it better to design formats that suit a single use case well, and swap between different formats as need arises? Or perhaps it is a better solution to actually design them so they could fit different needs?

Let’s consider this by exploring three areas: hash algorithms, compressed file formats and image formats.

Hash algorithms

Hash, digest, checksum — they have many names, and many uses. To list a few uses of hash functions and their derivatives:

  • verifying file integrity
  • verifying file authenticity
  • generating derived keys
  • generating unique keys for fast data access and comparison

Different use cases imply different requirements. The simple CRC algorithms were good enough to check files for random damage but they aren’t suitable for cryptographic purposes. The SHA hashes provide good resistance to attacks but they are too slow to speed up data lookups. That role is much better served by dedicated fast hashes such as xxHash. In my opinion, these are all examples of “do one thing well”.

On the other hand, there is some overlap. More often than not, cryptographic hash functions are used to verify integrity. Then we have modern hashes like BLAKE2 that are both fast and secure (though not as fast as dedicated fast hashes). Argon2 key derivation function builds upon BLAKE2 to improve its security even further, rather than inventing a new hash. These are the examples how a single tool is used to serve different purposes.

Compressed file formats

The purpose of compression, of course, is to reduce file size. However, individual algorithms may be optimized for different kinds of data and different goals.

Probably the oldest category are “archiving” algorithms that focus on providing strong compression and reasonably fast decompression. Back in the day, there were used to compress files in “cold storage” and for transfer; nowadays, they can be used basically for anything that you don’t modify very frequently. The common algorithms from this category include deflate (used by gzip, zip) and LZMA (used by 7z, lzip, xz).

Then, we have very strong algorithms that achieve remarkable compression at the cost of very slow compression and decompression. These are sometimes (but rarely) used for data distribution. An example of such algorithms are the PAQ family.

Then, we have very fast algorithms such as LZ4. They provide worse compression ratios than other algorithms, but they are so fast that they can be used to compress data on the fly. They can be used to speed up data access and transmission by reducing its size with no noticeable overhead.

Of course, many algorithms have different presets. You can run lz4 -9 to get stronger compression with LZ4, or xz -1 to get faster compression with XZ. However, neither the former will excel at compression ratio, nor the latter at speed.

Again, we are seeing different algorithms that “do one thing well”. However, nowadays ZSTD is gaining popularity and it spans a wider spectrum, being capable of both providing very fast compression (but not as fast as LZ4) and quite strong compression. What’s really important is that it’s capable of providing adaptive compression — that is, dynamically adjusting the compression level to provide the best throughput. It switches to a faster preset if the current one is slowing the transmission down, and to a stronger one if there is a potential speedup in that.

Image formats

Let’s discuss image formats now. If we look back far enough, we’d arrive at a time when two image formats were dominating the web. On one hand, we had GIF — with lossless compression, limited color palette, transparency and animations, that made it a good choice for computer-generated images. On the other, we had JPEG — with efficient lossy compression and high color depth suitable for photography. We could see these two as “doing one thing well”.

Then came PNG. PNG is also lossless but provides much higher color depth and improved support for transparency via an alpha channel. While it’s still the format of choice for computer-generated images, it’s also somewhat suitable for photography (but with less efficient compression). With APNG around, it effectively replaces GIF but it also partially overlaps with the use cases for JPEG.

Modern image formats go even further. WebP, AVIF and JPEG XL all support both lossless and lossy compession, high color depths, alpha channel, animation. Therefore, they are suitable both for computer-generated images and for photography. Effectively, they can replace all their predecessors with a “one size fits all” format.

Conclusion

I’ve asked whether it is better to design formats that focus on one specific use case, or whether formats that try to cover a whole spectrum of use cases are better. I’m afraid there’s no easy answer to this question.

We can clearly see that “one-size-fits-all” solutions are gaining popularity — BLAKE2 among hashes, ZSTD in compressed file formats, WebP, AVIF and JPEG XL among image formats. They have a single clear advantage — you need just one tool, one implementation.

Your web browser needs to support only one format that covers both computer-generated graphics using lossless compression and photographs using lossy compression. Different tools can reuse the same BLAKE2 implementation that’s well tested and audited. A single ZSTD library can serve different applications in their distinct use cases.

However, there is still a clear edge to algorithms that are focused on a single use case. xxHash is still faster than any hashes that could be remotely suitable for cryptographic purposes. LZ4 is still faster than ZSTD can be in its lowest compression mode.

The only reasonable conclusion seems to be: there are use cases for both. There are use cases that are best satisfied by a dedicated algorithm, and there are use cases when a more generic solution is better. There are use cases when integrating two different hash algorithms, two different compression libraries into your program, with the overhead involved, is a better choice, than using just one algorithm that fits neither of your two distinct use cases well.

Once again, it feels that a reference to XKCD#927 is appropriate. However, in this particular instance this isn’t a bad thing.

Leave a Reply

Your email address will not be published.