Week 12 Report for Refining ROCm Packages in Gentoo

Although this is the final week, I would like to say that it is as exciting as the first week.

I kept polishing rocm.eclass with the help of Michał and my mentor, and it is now in good shape [1]. I must admit that the time to write an eclass for a beginner like me is much more than what I expected. In my proposal, I leave 4 weeks to finish it, 2-week implementation and 2-week polishing. In reality, I implemented within 2 weeks, but polished it for 4 weeks. I made a lot of QA issues and was not aware, which increases the number of review-modify cycles. During this process, I leant a lot:

1. Always re-read the eclass, especially comments and examples thoroughly after modification. Many times I forgot there is an example far from the change that should be updated because one functions changes its behavior.

2. Read the bash manual carefully, because properly usage of features like bash array can greatly simplify code.

3. Consider the maintenance difficulty of the eclass. I wrote a oddly specific `src_test`, which can cover all the cases of ROCm packages. But it’s not worth it, because specialized code should be placed into ebuilds, not one eclass. So instead, I remain the most common part, `check_amdgpu`, and get rid of phase functions, which made the eclass much cleaner.

I also find some bugs and their solutions. As I mentioned in week 10’s report, I observed many test failures in sci-libs/miopen based on vanilla clang. In this week, I figured out that they have 3 different reasons, and I’ve provided the two fixes for two failures ([2, 3]). The third issue, I’ve found it’s root cause [4]. I believe there would be a simple solution to this.

For gcc-12 issues, I also come to a brutal workaround [5]: undef the __noinline__ macro before including stdc++ headers and def it afterwards. I also observed that clang-15 does not fix this issue as expected, and provided a MWE at [6].

I’m also writing wiki pages, filling installation and developing guide.

In this 12-week project, I proposed to deliver rocm.eclass, and packages like pytorch, tensorflow with rocm enabled. Instead, I delivered rocm.eclass as proposed, but migrated the ROCm toolchain to vanilla clang. I thought porting ROCm toolchain to vanilla clang is closer to my project title “Refining ROCm Packages” 🙂

[1] https://github.com/gentoo/gentoo/pull/26784
[2] https://github.com/littlewu2508/gentoo/commit/2bfae2e26a23d78b634a87ef4a0b3f0cc242dbc4
[3] https://github.com/littlewu2508/gentoo/commit/cd11b542aec825338ec396bce5c63bbced534e27
[4] https://github.com/ROCmSoftwarePlatform/MIOpen/issues/1731
[5] https://github.com/littlewu2508/gentoo/commit/2a49b4db336b075f2ac1fdfbc907f828105ea7e1
[6] https://github.com/llvm/llvm-project/issues/57544

This entry was posted in ROCm Packages. Bookmark the permalink.

Leave a Reply

Your email address will not be published.