Week 3 Report for Refining ROCm Packages in Gentoo

This week I’m quite busy on other work (school related, I’m at the end of the semester and the official summer vacation starts next week), so there is not much progress on finishing the plan mentioned in week 2’s report. Another reason is I’m focusing on investigating and leaning things such as ebuild writing.

AMD released ROCm-5.2 earlier this week, so I give it a try at https://github.com/littlewu2508/gentoo/tree/rocm-5.2 using llvm-14 backend (same as 5.1.3). I observed two bugs:
1. dev-libs/rocm-comgr-5.2.0 calls clang to compile for gfx1036 and use `-mcode-object-version=5`, which is unsupported by clang-14. This can be patched out, and the tests all passed except for the existing issue [1].
2. dev-util/rocm-device-libs-5.2.0 causes lld throw linking error when compiling HIP programs: `lld: error: undefined symbol: __oclc_ABI_version`. Due to limited time I did not dig in and found the cause. I suspect it’s caused by incompatibilities between the newest rocm-device-libs and llvm/clang-14. Currently rocm-device-libs-5.1.3 serves well for hip-5.2, so I decide to look at this carefully and consult upstream later.

Nevertheless, I installed hip-5.2 and it worked, but that did not make blender-3.2 HIP cycles work on Radeon VII. After reading blender developer forum and investigating the versioning of HIP, I find the answer in [2] — HIP cycles work on vega devices needs future releases of HIP.

There is also a gentoo user interested in installing the newest version of ROCm. I’ve been answering his/her questions about resolving errors and warnings when they bump to ROCm 5.1.3 and 5.2.0 [4][5], which provides valuable information. But in my plan, I won’t quickly bump to ROCm-5.2.0 because there are two incompatibilities mentioned above, so I suggest to wait for the next version of clang (probably clang-15). Also, I have not seen any urgent need of ROCm-5.2.x.

Another interesting thing is ROCm on APU. I have a Ryzen 4800u Laptop, and 2 years ago in the age of ROCm-3.5, the iGPU is marked as gfx902, but the hip program compiled to gfx902 caused weird behaviours such as dead screen. Now rocminfo shows that it is gfx90c, and hip program can run smoothly. So I wonder if the full ROCm stack can be installed and run on this APU. Sadly rocBLAS fails to compile — clang throws internal error when compiling the gigantic Tensile kernel to gfx90c.

The important job this week is learning eclass syntax. I went through the eclass writing guide, and read some eclass examples (mainly llvm.org.eclass because it has similar USE_EXPAND case). I started writing rocm.eclass, currently working on handling the USE_EXPAND of AMDGPU_TARGETS, and determine compilation architectures depend on AMDGPU_TARGETS. I’m developing it at [3], and I hope by the end of week 4 the core functionalities can be finished, and I’ll launch a PR to get comments, then mail it to Gentoo-dev mailing list for more suggestions.

[1] https://github.com/RadeonOpenCompute/ROCm-CompilerSupport/issues/45
[2] https://bugs.gentoo.org/693200#c32
[3] https://github.com/littlewu2508/gentoo/blob/rocm-5.1.3/eclass/rocm.eclass
[4] https://bugs.gentoo.org/851702#c9
[5] https://bugs.gentoo.org/693200#c35

This entry was posted in ROCm Packages. Bookmark the permalink.

Leave a Reply

Your email address will not be published.