Refining ROCm Packages in Gentoo — project summary

12 weeks quickly slips away, and I’m proud to say that the packaging quality of ROCm in Gentoo does gets improved in this project.

Two sets of major deliverables are achieved: New ebuilds of ROCm-5.1.3 tool-chain that purely depends on vanilla llvm/clang, and rocm.eclass along with ROCm-5.1.3 libraries utilizing them. Each brings one great QA improvement compare to the original ROCm packaging method.

Beyond these, I also maintained rocprofiler, rocm-opencl-runtimes, bumping their version with nontrivial changes. I discovered several bugs, and talked to upstream. I also wrote ROCm wiki pages, which starts my journey on Gentoo wiki.

By writing rocm.eclass, I learnt pretty much about eclass writing — how to design, how to balance needs and QA concerns, how to write comments and examples well, etc. I’m really grateful to those Gentoo developers who pointed out my mistakes and helped me polishing my eclass.

Since I’m working on top of Gentoo repo, my work is scattered around rather than having my own repo. My major products can be seen in [0], where all my PRs to ::gentoo located. My weekly report can be found on Gentoo GSoC blogs

[0] My finished PRs for gentoo during GSoC 2022

Details are as followed:

First, it’s about ROCm on vanilla llvm/clang

Originally, ROCm has its own llvm fork, which has some modifications not upstreamed yet. In the original Gentoo ROCm packaging roadmap, sys-devel/llvm-roc is introduced as the ROCm forked llvm/clang. This is the simple way, and worked well on ROCm-only packages [1]. But it brings troubles if a large project like blender pulls in dependencies using vanilla llvm, and results in symbol collision [2].

So, when I noticed [1] in week 1, I began my journey on porting ROCm on vanilla clang. I’m very lucky, because at that time clang-14.0.5 was just released, eliminating major obstacles for porting (previous versions more or less have bugs). After some quick hack I succeeded, which is recorded in the week 1 report [3]. In that week I successfully built blender with hip cycles (GPU-accelerated render code written in HIP), and rendered some example projects on a Radeon RX 6700XT.

While I was thrilled in porting ROCm tool-chain upon vanilla clang, my mentor pointed out that I have carelessly brought some serious bugs in ::gentoo. In week 2, I managed to fix bugs I created, and set up a reproducible test ground using docker, to make test more easy and clean and avoid such bugs from happening again. Details can be found in week 2’s report [4].

After that there weren’t non-trivial progresses in porting to vanilla clang, only bug fixes and ebuild polishing, until I met MIOpen in the last week.

The story of debugging MIOpen assemblies

In week 12 rocm.eclass is almost in its final shape, so I began to land ROCm libraries [1] including sci-libs/miopen. ROCm libraries are usually written in “high level” languages like HIP, while dev-util/hip is already ported to use vanilla clang in good shape, so there is no need to worry compilation problems. However, MIOpen have various hand-written assemblies for JIT, which causes several test failures [5]. It was frustrating because I’m unfamiliar with AMDGPU assemblies, so I was close to gave up (my mentor also suggest to give up working on it in GSoC). Thus, I reported my problem to upstream in [5], attached with my debugging attempts.

Thanks to my testing system mentioned previously, I have setup not only standard environments, but also one snapshot with full llvm/clang debug symbols. I quickly located the problem and reported to upstream via issue, but I still didn’t know why the error is happening.

In the second day, I decided to look at the assembly and debugging result once again. This time fortune is on my side, and I discovered the key issue is LLVM treating Y and N in metadata as boolean values, not strings (they should be kernel parameter names) [6]. I provided a fix in [7], and all tests passed on both Radeon VII and Radeon RX 6700XT. Amazing! I have also mentioned how excited I was in week 12’s report [8].

[1] For example, ROCm libraries in https://github.com/ROCmSoftwarePlatform
[2] https://bugs.gentoo.org/693200
[3] Week 1 Report for Refining ROCm Packages in Gentoo
[4] Week 4 Report for Refining ROCm Packages in Gentoo
[5] https://github.com/ROCmSoftwarePlatform/MIOpen/issues/1731
[6] https://github.com/ROCmSoftwarePlatform/MIOpen/issues/1731#issuecomment-1236913096
[7] https://github.com/littlewu2508/gentoo/commit/40eb81f151f43eb5d833dc7440b02f12dab04b89
[8] Week 12 Report for Refining ROCm Packages in Gentoo

The second deliverable is rocm.eclass

The most challenging part for me, is to write rocm.eclass. I started writing it in week 4 [9], and finished my design in week 8 [10] (including 10 days of temporary leave). In week 9-12, I posted 7 revisions of rocm.eclass in gentoo-dev mailing list [10,11], and received many helpful comments. Also, on Github PR [12], I also got lots of suggestions from Gentoo developers.

Eventually, I finished rocm.eclass, providing amdgpu_targets USE_EXPAND, ROCM_REQUIRED_USE, and ROCM_USE_DEP to control which gpu targets to compile, and coherency among dependencies. The eclass provides get_amdgpu_flags for src_configure and check_amdgpu for ensuring AMDGPU device accessibility in src_test. Finally, rocm.eclass is merged into ::gentoo in [13].

[9] Week 9 Report for Refining ROCm Packages in Gentoo
[10] https://archives.gentoo.org/gentoo-dev/threads/2022-08/
[11] https://archives.gentoo.org/gentoo-dev/threads/2022-09/
[12] https://github.com/gentoo/gentoo/pull/26784
[13] https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=cf8a6a845b68b578772f2ae0d2703f203c6dec33

Other coding products

Merged ebuilds

rocprofiler

I have bumped dev-util/rocprofiler and its dependencies to version 5.1.3, and fixed proprietary aql profiler lib loading, so ROCm stack on Gentoo stays fully open-sourced without losing most profiling functionalities [14].

[14] https://github.com/ROCm-Developer-Tools/rocprofiler/issues/38

Unmerged ebuilds

Due to limited time and long testing period, ebuilds of ROCm-5.1.3 libraries (ones using rocm.eclass) does not get merged. They can be found in this PR.
dev-libs/rocm-opencl-runtime is a critical package because it provides opencl, and many users still use opencl for GPGPU since HIP is a new stuff. I bumped it to 5.1.3 to match the vanilla clang tool-chain, and enabled its src_test, so users can make sure that vanilla clang isn’t breaking anything. The PR is located here.

Bug fixes

Existing bug fixing is also a part of my GSoC. I have created various PRs and closed corresponding bugs on Gentoo Bugzilla: #822828, #853718, #851795, #851792, #852236, #850937, #836248, #836274, #866839. Also, many bug fixing happens before new packages enter the gentoo main repo, or they are found by myself in the first place, so there is no record on Bugzilla.

Last but not least, the wiki page

I have created 3 pages [15-17], filling important information about ROCm. I also received a lot of help from the Gentoo community, mainly focused on refining my wiki to meet the standards.

[15] https://wiki.gentoo.org/wiki/ROCm
[16] https://wiki.gentoo.org/wiki/HIP
[17] https://wiki.gentoo.org/wiki/Rocprofiler

Comparison with original plan

The original plan in proposal also contained rocm.eclass. But it only allocated the last week for “investigation on vanilla clang”. In week 1, my mentor and I added “porting ROCm on vanilla clang” to the plan, and this became the new major deliverable. Due to the time limit, packaging high level frameworks like pytorch and tensorflow is abandoned. I only worked to get CuPy worked [18], showing rocm.eclass functionality on packages that depend on ROCm libraries.

I think the change of plan and deliverables better annotated the project title “Refining”, because what I did greatly improves the quality of existing ebuilds, rather than introducing more ebuilds.

[18] https://github.com/littlewu2508/gentoo/commit/3d142fa4b4ada560c053c2fd3c8c1501c82aace2

This entry was posted in ROCm Packages. Bookmark the permalink.

Leave a Reply

Your email address will not be published.