Week 3 – Modernization of Portage
It is the third week of the coding period. It is mostly an uneventful week. Most part was spent on trying to understand the dependency resolution algorithm. In the second part of the week I also did some refactoring and some type hints.
Update on the blog posts
I lost my password to access this blog and also had troubles resetting the password. That is why I have not been able to post per week. With help from BlueKnight, I got my access back. So, I am dumping the blog posts I have written, all at once. From next week, I expect posts to be at regular intervals (one per week). Sorry about the bulk posting, hope you don’t mind.
Portage vs Pacman
The most significant part of portage is it’s dependency resolution system. It is very different from all other package managers due to the unique concept ot “USE” flags.
As many can guess, a graph data structure is used extensively to find the dependencies of a package. The process is somewhat trivial in most package managers. Arch Linux’s pacman for example constructs a graph with package names as nodes and it’s dependencies as it’s children. Now it’s a matter of a simple graph traversal. The detected dependencies are copied to a list and everything is downloaded from the a server. The implementation can be found here.
Portage uses a different algorithm called backtracking. Portage finds the package’s use flags and backtracks the packages related to that use flag. It is recursive and the process repeats. The time complexity is O(N!) and I understood that’s the reason portage is slow and not because of python. Even after hours of looking at the codebase it is hard to make sense of the exact workings. Due to the sheer number of features portage offers, the process is many folds complex than pacman. This can be understood by the fact that the dependency resolution part (not completely) of pacman is like 927 lines of c code at the moment of writing this and the portage equivalent is about 11814 lines of python code. I also had the code profiled and
depgraph.py alone takes like 95% of the total runtime. So it is useful to learn about the algorithm. The profiling results can be found here.
Here is the image version.
Portage also uses granular, file level locks, whereas pacman uses one lock for the whole runtime. This allows multiple instances of portage to run at the same time.
Studying portage’s codebase made it much easier for me to understand pacman’s codebase and in no time, even though I don’t have much experience in C. I am glad I got to work on portage, but also sad that I cannot understand the codebase enough (yet) to contribute in a deeper way. Portage is years of work of many smart minds and it is unreasonable of me to expect to understand things fast.
Refactoring / Tidying up portage
The second half of the week was towards finding something to improve on portage. I added a few type hints and refactored some big functions into smaller ones for better readability. Also found some unreachable code and removed them as well. I think they are nice improvements. I sent a pull request which can be found here. It is not merged yet, I think it will be soon.
To summarize, the week was spent delving more into the cProfile results of portage and comparing it with pacman. I also refactored the codebase a little bit, which makes it a tiny bit better.
I want to look at portage’s ability to resolve circular dependencies but I am not sure if I can do it (if it is a simple task). If it were simple, portage developers would have done it already, but I still want to give it a try. I’ll keep you posted. That marks the end of week three. See you in week four’s post!