It is a pretty common knowledge that ordering of members within archive can affect the compression ratio. I’ve done some quick testing and the results somewhat surprised me. Firstly, it turned out that the simplest lexical sorting by name (path) gave the best result. Secondly, because it turned out that the difference between that and sorting by size was as large as 8%.
Note that this is a pretty specific source archive, so results may vary. Test details and commands in the remainder of the post.
Sort order | Size in bytes | Compared to best |
---|---|---|
name | 108 011 756 | 100.00% |
suffix | 108 573 612 | 100.52% |
size (smallest first) | 116 797 440 | 108.13% |
size (largest first) | 116 645 940 | 108.00% |
suffix + size | 111 709 128 | 103.42% |
The conclusion? Sorting can affect compression ratio more than I have anticipated. However, all the “obvious” optimizations have made the result worse than plain lexical sorting. Perhaps it’s just the matter of well-organized source code keeping similar files in the same directories. Perhaps there is a way to optimize it even more (and beat sorting by name). One interesting option would be to group files by bucket sizes, and then sort by name.
Special thanks to Adrien Nader and Lasse Collin from #tukaani for inspiring me to do this.