.tar sorting vs .xz compression ratio

It is a pretty common knowledge that ordering of members within archive can affect the compression ratio. I’ve done some quick testing and the results somewhat surprised me. Firstly, it turned out that the simplest lexical sorting by name (path) gave the best result. Secondly, because it turned out that the difference between that and sorting by size was as large as 8%.

Note that this is a pretty specific source archive, so results may vary. Test details and commands in the remainder of the post.

Compression results per sort order
Sort order Size in bytes Compared to best
name 108 011 756 100.00%
suffix 108 573 612 100.52%
size (smallest first) 116 797 440 108.13%
size (largest first) 116 645 940 108.00%
suffix + size 111 709 128 103.42%

The conclusion? Sorting can affect compression ratio more than I have anticipated. However, all the “obvious” optimizations have made the result worse than plain lexical sorting. Perhaps it’s just the matter of well-organized source code keeping similar files in the same directories. Perhaps there is a way to optimize it even more (and beat sorting by name). One interesting option would be to group files by bucket sizes, and then sort by name.

Special thanks to Adrien Nader and Lasse Collin from #tukaani for inspiring me to do this.

Testing details

Testing was done on data from llvm-project-f6f1fd443f48f417de9dfe23353055f1b20d87ef.tar.gz snapshot. The archive was unpacked, then repacked using sorted file lists. Directory entries were not included in the archive.

Sorting by name was done using plain lexical sort, using the following command:

find llvm-project-f6f1fd443f48f417de9dfe23353055f1b20d87ef '!' -type d | sort  | tar -cvvf /var/tmp/llvm.byname.tar -T -

Sorting by size involved numeric sort by file size, then by name:

find llvm-project-f6f1fd443f48f417de9dfe23353055f1b20d87ef '!' -type d -printf '%s %p\n' | sort -n | cut -d' ' -f2- | tar -cf /var/tmp/llvm.bysize.tar -T -
find llvm-project-f6f1fd443f48f417de9dfe23353055f1b20d87ef '!' -type d -printf '%s %p\n' | sort -k 1nr,2 | cut -d' ' -f2- | tar -cvvf /var/tmp/llvm.bysize.rev.tar -T -

Sorting by suffix involved extracting the final file suffix (if any) and sorting by it, then by full name:

find llvm-project-f6f1fd443f48f417de9dfe23353055f1b20d87ef '!' -type d | sed -r -e 's:^:_ :' -e 's:.*\.([^/]*):\1&:' | sort | cut -d' ' -f2- | tar -cf /var/tmp/llvm.bysuffix.tar -T -

Sorting by suffix + size involved sorting by suffix first (so that similar files are grouped together), then by size, then by full path. Took some find(1) hackery:

find llvm-project-f6f1fd443f48f417de9dfe23353055f1b20d87ef '!' -type d -printf '%s %p\n' | sed -r -e 's:^:_ :' -e 's:.*\.([^/]*):\1&:' | sort -n -k2 | sort -k 1,1 -s | cut -d' ' -f3- | tar -cf /var/tmp/llvm.bysuffix.size.tar -T -

Leave a Reply

Your email address will not be published.