Week 5 is mainly about utilizing and testing the rocm.eclass I wrote — packagingi and testing the ROCm-5.1.3 libraries. I also began to land ROCm-5.1.3 toolchains in ::gentoo. However, new problems emerges, so I’m a bit behind schedule, so after negotiating with my mentor, I decide to put packaging tesorflow and jax with rocm into low priority jobs.
On https://github.com/littlewu2508/gentoo/tree/rocm-5.1.3 there is my newest progress, sci-libs/roc{BLAS,FFT,PRIM,SPARSE,Thrust}-5.1.3 utilizing rocm.eclass. I have write amdgpu_targets.desc and added to profile/desc, so each amdgpu_targets_ USE_EXPAND have its description (the name and codename of the architecture, as well as the included graphics cards).
It turned out rocm.eclass simplified those ebuilds, especially src_test. I have spent some time testing those libraries on Radeon VII and Radeon RX 6700XT. By running tests I’ve found a critical bug in rocFFT-5.1.3 [1], and was confirmed by upstream. It should be cautious, and before the bug is fixed, amdgpu_targets_gfx906 should be masked for rocFFT-5.1.3. On the other hand, 6700XT failed several tests on rocSPARSE, which is explained by upstream [2]. rocBLAS pass tests on Radeon VII, but causes amdgpu kernel module failure for some unknown reason (maybe the load is two high, because when I restarted and ran the failed test suite, it worked normally, it’s just running the entire test failed the GPU). Other packages passed all the tests on these two cards.
Meanwhile I’m also working on dev-libs/rccl and dev-libs/rocm-opencl-runtime. dev-libs/rccl, like sci-libs/roc-*, can utilize rocm.eclass and works well; however there are build failures due to calling `chrpath -r` on a library without rpath (rocm.eclas set -DSKIP_RPATH=ON). I shall make it work in the coming week. For rocm-opencl-runtime, I managed to turn on USE=test, but there are test failures on 6700XT which needs to be further investigated. Also, some of the tests in rocm-opencl-runtime needs a DISPLAY. I tried virtualx.eclass as ionen suggested in #gentoo-soc IRC, but in my docker environment that didn’t work. In Gentoo prefix vitualx does not work, either.
I came across another bug when compiling rccl-5.1.3 with gfx10xx [3]. After consulting Gentoo llvm maintainer, I opened an issue on llvm-project to ask for acknowledgement on backporting a patch to llvm-14 which fix this problem [4].
As I prepare to land ROCm-5.1.3 toolchain in ::gentoo via this PR [5], I noticed another problem. hip and rocm-comgr has hard-coded clang include path in their sources, so if clang upgrades (even minor version upgrades like 14.0.5 -> 14.0.6) would cause runtime problem. I have consulted mgorny about this problem. He suggested me to try hacking into the clang Driver, and see whether the include path can be extracted using C++ API at runtime. I’ll try this in the coming week, and if I failed, adding subslot to clang may be the plan B. After fixing this, I think hip-5.1.3 and rocm-comgr-5.1.3 would be ready to land in ::gentoo.
Due to limited time I have little progress on rocm.eclass. I begun read PYTHON_USEDEP in python eclasses, to prepare for ROCM_USEDEP. I plan to implement this in the coming week, completing the last piece of rocm.eclass.
And here is the brief plan of feature works for the following weeks, after lowering the priority of tensorflow and jax:
week 6: finish rocm.eclass, send for review; continue packaging ROCm libs;
week 7: modify rocm.eclass according to comments; packaging ROCm libs, including rocWMMA;
week 8: finalize rocm.eclass; start working on cupy;
week 9: cupy ebuild; start writing wiki;
week 10: get cupy land in ::gentoo; bump dev-util/rocprofiler to 5.1.3;
week 11: continue wiki writing ; consider ROCgdb;
week 12: finish wiki; summaries my GSoC.
[1] https://github.com/ROCmSoftwarePlatform/rocFFT/issues/369
[2] https://github.com/ROCmSoftwarePlatform/rocSPARSE/issues/258
[3] https://bugs.gentoo.org/851702#c15
[4] https://github.com/llvm/llvm-project/issues/56577
[5] https://github.com/gentoo/gentoo/pull/26441