{"id":129,"date":"2022-07-18T16:07:43","date_gmt":"2022-07-18T16:07:43","guid":{"rendered":"https:\/\/blogs.gentoo.org\/gsoc\/?p=129"},"modified":"2022-07-18T16:07:43","modified_gmt":"2022-07-18T16:07:43","slug":"week-5-report-for-refining-rocm-packages-in-gentoo","status":"publish","type":"post","link":"https:\/\/blogs.gentoo.org\/gsoc\/2022\/07\/18\/week-5-report-for-refining-rocm-packages-in-gentoo\/","title":{"rendered":"Week 5 Report for Refining ROCm Packages in Gentoo"},"content":{"rendered":"<p>Week 5 is mainly about utilizing and testing the rocm.eclass I wrote &#8212; packagingi and testing the ROCm-5.1.3 libraries. I also began to land ROCm-5.1.3 toolchains in ::gentoo. However, new problems emerges, so I&#8217;m a bit behind schedule, so after negotiating with my mentor, I decide to put packaging tesorflow and jax with rocm into low priority jobs.<\/p>\n<p>On https:\/\/github.com\/littlewu2508\/gentoo\/tree\/rocm-5.1.3 there is my newest progress, sci-libs\/roc{BLAS,FFT,PRIM,SPARSE,Thrust}-5.1.3 utilizing rocm.eclass. I have write amdgpu_targets.desc and added to profile\/desc, so each amdgpu_targets_ USE_EXPAND have its description (the name and codename of the architecture, as well as the included graphics cards).<\/p>\n<p><img loading=\"lazy\" class=\"alignnone size-medium wp-image-130\" src=\"http:\/\/blogs.gentoo.org\/gsoc\/files\/2022\/07\/Screenshot_20220719_000606-300x56.png\" alt=\"\" width=\"300\" height=\"56\" srcset=\"https:\/\/blogs.gentoo.org\/gsoc\/files\/2022\/07\/Screenshot_20220719_000606-300x56.png 300w, https:\/\/blogs.gentoo.org\/gsoc\/files\/2022\/07\/Screenshot_20220719_000606-1024x193.png 1024w, https:\/\/blogs.gentoo.org\/gsoc\/files\/2022\/07\/Screenshot_20220719_000606-768x145.png 768w, https:\/\/blogs.gentoo.org\/gsoc\/files\/2022\/07\/Screenshot_20220719_000606-1536x289.png 1536w, https:\/\/blogs.gentoo.org\/gsoc\/files\/2022\/07\/Screenshot_20220719_000606-882x165.png 882w, https:\/\/blogs.gentoo.org\/gsoc\/files\/2022\/07\/Screenshot_20220719_000606.png 1891w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/p>\n<p>It turned out rocm.eclass simplified those ebuilds, especially src_test. I have spent some time testing those libraries on Radeon VII and Radeon RX 6700XT. By running tests I&#8217;ve found a critical bug in rocFFT-5.1.3 [1], and was confirmed by upstream. It should be cautious, and before the bug is fixed, amdgpu_targets_gfx906 should be masked for rocFFT-5.1.3. On the other hand, 6700XT failed several tests on rocSPARSE, which is explained by upstream [2]. rocBLAS pass tests on Radeon VII, but causes amdgpu kernel module failure for some unknown reason (maybe the load is two high, because when I restarted and ran the failed test suite, it worked normally, it&#8217;s just running the entire test failed the GPU). Other packages passed all the tests on these two cards.<\/p>\n<p>Meanwhile I&#8217;m also working on dev-libs\/rccl and dev-libs\/rocm-opencl-runtime. dev-libs\/rccl, like sci-libs\/roc-*, can utilize rocm.eclass and works well; however there are build failures due to calling `chrpath -r` on a library without rpath (rocm.eclas set -DSKIP_RPATH=ON). I shall make it work in the coming week. For rocm-opencl-runtime, I managed to turn on USE=test, but there are test failures on 6700XT which needs to be further investigated. Also, some of the tests in rocm-opencl-runtime needs a DISPLAY. I tried virtualx.eclass as ionen suggested in #gentoo-soc IRC, but in my docker environment that didn&#8217;t work. In Gentoo prefix vitualx does not work, either.<\/p>\n<p>I came across another bug when compiling rccl-5.1.3 with gfx10xx [3]. After consulting Gentoo llvm maintainer, I opened an issue on llvm-project to ask for acknowledgement on backporting a patch to llvm-14 which fix this problem [4].<\/p>\n<p>As I prepare to land ROCm-5.1.3 toolchain in ::gentoo via this PR [5], I noticed another problem. hip and rocm-comgr has hard-coded clang include path in their sources, so if clang upgrades (even minor version upgrades like 14.0.5 -&gt; 14.0.6) would cause runtime problem. I have consulted mgorny about this problem. He suggested me to try hacking into the clang Driver, and see whether the include path can be extracted using C++ API at runtime. I&#8217;ll try this in the coming week, and if I failed, adding subslot to clang may be the plan B. After fixing this, I think hip-5.1.3 and rocm-comgr-5.1.3 would be ready to land in ::gentoo.<\/p>\n<p>Due to limited time I have little progress on rocm.eclass. I begun read PYTHON_USEDEP in python eclasses, to prepare for ROCM_USEDEP. I plan to implement this in the coming week, completing the last piece of rocm.eclass.<\/p>\n<p>And here is the brief plan of feature works for the following weeks, after lowering the priority of tensorflow and jax:<\/p>\n<p>week 6: finish rocm.eclass, send for review; continue packaging ROCm libs;<br \/>\nweek 7: modify rocm.eclass according to comments; packaging ROCm libs, including rocWMMA;<br \/>\nweek 8: finalize rocm.eclass; start working on cupy;<br \/>\nweek 9: cupy ebuild; start writing wiki;<br \/>\nweek 10: get cupy land in ::gentoo; bump dev-util\/rocprofiler to 5.1.3;<br \/>\nweek 11: continue wiki writing ; consider ROCgdb;<br \/>\nweek 12: finish wiki; summaries my GSoC.<\/p>\n<p>[1] https:\/\/github.com\/ROCmSoftwarePlatform\/rocFFT\/issues\/369<br \/>\n[2] https:\/\/github.com\/ROCmSoftwarePlatform\/rocSPARSE\/issues\/258<br \/>\n[3] https:\/\/bugs.gentoo.org\/851702#c15<br \/>\n[4] https:\/\/github.com\/llvm\/llvm-project\/issues\/56577<br \/>\n[5] https:\/\/github.com\/gentoo\/gentoo\/pull\/26441<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Week 5 is mainly about utilizing and testing the rocm.eclass I wrote &#8212; packagingi and testing the ROCm-5.1.3 libraries. I also began to land ROCm-5.1.3 toolchains in ::gentoo. However, new problems emerges, so I&#8217;m a bit behind schedule, so after &hellip; <a href=\"https:\/\/blogs.gentoo.org\/gsoc\/2022\/07\/18\/week-5-report-for-refining-rocm-packages-in-gentoo\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":179,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/blogs.gentoo.org\/gsoc\/wp-json\/wp\/v2\/posts\/129"}],"collection":[{"href":"https:\/\/blogs.gentoo.org\/gsoc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.gentoo.org\/gsoc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.gentoo.org\/gsoc\/wp-json\/wp\/v2\/users\/179"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.gentoo.org\/gsoc\/wp-json\/wp\/v2\/comments?post=129"}],"version-history":[{"count":1,"href":"https:\/\/blogs.gentoo.org\/gsoc\/wp-json\/wp\/v2\/posts\/129\/revisions"}],"predecessor-version":[{"id":131,"href":"https:\/\/blogs.gentoo.org\/gsoc\/wp-json\/wp\/v2\/posts\/129\/revisions\/131"}],"wp:attachment":[{"href":"https:\/\/blogs.gentoo.org\/gsoc\/wp-json\/wp\/v2\/media?parent=129"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.gentoo.org\/gsoc\/wp-json\/wp\/v2\/categories?post=129"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.gentoo.org\/gsoc\/wp-json\/wp\/v2\/tags?post=129"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}