{"id":358,"date":"2022-09-12T13:07:22","date_gmt":"2022-09-12T13:07:22","guid":{"rendered":"https:\/\/blogs.gentoo.org\/gsoc\/?p=358"},"modified":"2023-05-05T03:04:16","modified_gmt":"2023-05-05T03:04:16","slug":"refining-rocm-packages-in-gentoo-project-summary","status":"publish","type":"post","link":"https:\/\/blogs.gentoo.org\/gsoc\/2022\/09\/12\/refining-rocm-packages-in-gentoo-project-summary\/","title":{"rendered":"Refining ROCm Packages in Gentoo &#8212; project summary"},"content":{"rendered":"<p>12 weeks quickly slips away, and I&#8217;m proud to say that the packaging quality of ROCm in Gentoo does gets improved in this project.<\/p>\n<p>Two sets of major deliverables are achieved: New ebuilds of ROCm-5.1.3 tool-chain that purely depends on vanilla llvm\/clang, and <code>rocm.eclass<\/code> along with ROCm-5.1.3 libraries utilizing them. Each brings one great QA improvement compare to the original ROCm packaging method.<\/p>\n<p>Beyond these, I also maintained rocprofiler, rocm-opencl-runtimes, bumping their version with nontrivial changes. I discovered several bugs, and talked to upstream. I also wrote ROCm wiki pages, which starts my journey on Gentoo wiki.<\/p>\n<p>By writing <code>rocm.eclass<\/code>, I learnt pretty much about eclass writing &#8212; how to design, how to balance needs and QA concerns, how to write comments and examples well, etc. I&#8217;m really grateful to those Gentoo developers who pointed out my mistakes and helped me polishing my eclass.<\/p>\n<p>Since I&#8217;m working on top of Gentoo repo, my work is scattered around rather than having my own repo. My major products can be seen in [0], where all my PRs to <code>::gentoo<\/code> located. My weekly report can be found on <a href=\"https:\/\/blogs.gentoo.org\/gsoc\/?s=ROCm\" target=\"_blank\" rel=\"noopener\">Gentoo GSoC blogs<\/a><\/p>\n<p>[0] <a href=\"https:\/\/github.com\/gentoo\/gentoo\/pulls?q=is%3Apr+author%3Alittlewu2508+is%3Aclosed+closed%3A2022-06-13..2022-09-05+\" target=\"_blank\" rel=\"noopener\">My finished PRs for gentoo during GSoC 2022<\/a><\/p>\n<p>Details are as followed:<\/p>\n<h2>First, it&#8217;s about ROCm on vanilla llvm\/clang<\/h2>\n<p>Originally, ROCm has its own llvm fork, which has some modifications not upstreamed yet. In the original Gentoo ROCm packaging roadmap, <code>sys-devel\/llvm-roc<\/code> is introduced as the ROCm forked llvm\/clang. This is the simple way, and worked well on ROCm-only packages [1]. But it brings troubles if a large project like blender pulls in dependencies using vanilla llvm, and results in symbol collision [2].<\/p>\n<p>So, when I noticed [1] in week 1, I began my journey on porting ROCm on vanilla clang. I&#8217;m very lucky, because at that time clang-14.0.5 was just released, eliminating major obstacles for porting (previous versions more or less have bugs). After some quick hack I succeeded, which is recorded in the week 1 report [3]. In that week I successfully built blender with hip cycles (GPU-accelerated render code written in HIP), and rendered some example projects on a Radeon RX 6700XT.<\/p>\n<p>While I was thrilled in porting ROCm tool-chain upon vanilla clang, my mentor pointed out that I have carelessly brought some serious bugs in ::gentoo. In week 2, I managed to fix bugs I created, and set up a reproducible test ground using docker, to make test more easy and clean and avoid such bugs from happening again. Details can be found in week 2&#8217;s report [4].<\/p>\n<p>After that there weren&#8217;t non-trivial progresses in porting to vanilla clang, only bug fixes and ebuild polishing, until I met MIOpen in the last week.<\/p>\n<h3>The story of debugging MIOpen assemblies<\/h3>\n<p>In week 12 rocm.eclass is almost in its final shape, so I began to land ROCm libraries [1] including <code>sci-libs\/miopen<\/code>. ROCm libraries are usually written in &#8220;high level&#8221; languages like HIP, while <code>dev-util\/hip<\/code> is already ported to use vanilla clang in good shape, so there is no need to worry compilation problems. However, MIOpen have various hand-written assemblies for JIT, which causes several test failures [5]. It was frustrating because I&#8217;m unfamiliar with AMDGPU assemblies, so I was close to gave up (my mentor also suggest to give up working on it in GSoC). Thus, I reported my problem to upstream in [5], attached with my debugging attempts.<\/p>\n<p>Thanks to my testing system mentioned previously, I have setup not only standard environments, but also one snapshot with full llvm\/clang debug symbols. I quickly located the problem and reported to upstream via issue, but I still didn&#8217;t know why the error is happening.<\/p>\n<p>In the second day, I decided to look at the assembly and debugging result once again. This time fortune is on my side, and I discovered the key issue is LLVM treating <code>Y<\/code> and <code>N<\/code> in metadata as boolean values, not strings (they should be kernel parameter names) [6]. I provided a fix in [7], and all tests passed on both Radeon VII and Radeon RX 6700XT. Amazing! I have also mentioned how excited I was in week 12&#8217;s report [8].<\/p>\n<p>[1] For example, ROCm libraries in <a href=\"https:\/\/github.com\/ROCmSoftwarePlatform\">https:\/\/github.com\/ROCmSoftwarePlatform<\/a><br \/>\n[2] <a href=\"https:\/\/bugs.gentoo.org\/693200\">https:\/\/bugs.gentoo.org\/693200<\/a><br \/>\n[3] <a href=\"https:\/\/blogs.gentoo.org\/gsoc\/2022\/07\/12\/week-1-report-for-refining-rocm-packages-in-gentoo\/\">Week 1 Report for Refining ROCm Packages in Gentoo<\/a><br \/>\n[4] <a href=\"https:\/\/blogs.gentoo.org\/gsoc\/2022\/07\/12\/106\/\">Week 4 Report for Refining ROCm Packages in Gentoo<\/a><br \/>\n[5] <a href=\"https:\/\/github.com\/ROCmSoftwarePlatform\/MIOpen\/issues\/1731\">https:\/\/github.com\/ROCmSoftwarePlatform\/MIOpen\/issues\/1731<\/a><br \/>\n[6] <a href=\"https:\/\/github.com\/ROCmSoftwarePlatform\/MIOpen\/issues\/1731#issuecomment-1236913096\">https:\/\/github.com\/ROCmSoftwarePlatform\/MIOpen\/issues\/1731#issuecomment-1236913096<\/a><br \/>\n[7] <a href=\"https:\/\/github.com\/littlewu2508\/gentoo\/commit\/40eb81f151f43eb5d833dc7440b02f12dab04b89\">https:\/\/github.com\/littlewu2508\/gentoo\/commit\/40eb81f151f43eb5d833dc7440b02f12dab04b89<\/a><br \/>\n[8] <a href=\"https:\/\/blogs.gentoo.org\/gsoc\/2022\/09\/11\/week-12-report-for-refining-rocm-packages-in-gentoo\/\">Week 12 Report for Refining ROCm Packages in Gentoo<\/a><\/p>\n<h2>The second deliverable is rocm.eclass<\/h2>\n<p>The most challenging part for me, is to write <code>rocm.eclass<\/code>. I started writing it in week 4 [9], and finished my design in week 8 [10] (including 10 days of temporary leave). In week 9-12, I posted 7 revisions of rocm.eclass in gentoo-dev mailing list [10,11], and received many helpful comments. Also, on Github PR [12], I also got lots of suggestions from Gentoo developers.<\/p>\n<p>Eventually, I finished <code>rocm.eclass<\/code>, providing <code>amdgpu_targets USE_EXPAND<\/code>, <code>ROCM_REQUIRED_USE<\/code>, and <code>ROCM_USE_DEP<\/code> to control which gpu targets to compile, and coherency among dependencies. The eclass provides <code>get_amdgpu_flags<\/code> for <code>src_configure<\/code> and <code>check_amdgpu<\/code> for ensuring AMDGPU device accessibility in <code>src_test<\/code>. Finally, <code>rocm.eclass<\/code> is merged into <code>::gentoo<\/code> in [13].<\/p>\n<p>[9] <a href=\"https:\/\/blogs.gentoo.org\/gsoc\/2022\/09\/11\/week-9-report-for-refining-rocm-packages-in-gentoo\/\">Week 9 Report for Refining ROCm Packages in Gentoo<\/a><br \/>\n[10] <a href=\"https:\/\/archives.gentoo.org\/gentoo-dev\/threads\/2022-08\/\">https:\/\/archives.gentoo.org\/gentoo-dev\/threads\/2022-08\/<\/a><br \/>\n[11] <a href=\"https:\/\/archives.gentoo.org\/gentoo-dev\/threads\/2022-09\/\">https:\/\/archives.gentoo.org\/gentoo-dev\/threads\/2022-09\/<\/a><br \/>\n[12] <a href=\"https:\/\/github.com\/gentoo\/gentoo\/pull\/26784\">https:\/\/github.com\/gentoo\/gentoo\/pull\/26784<\/a><br \/>\n[13] <a href=\"https:\/\/gitweb.gentoo.org\/repo\/gentoo.git\/commit\/?id=cf8a6a845b68b578772f2ae0d2703f203c6dec33\">https:\/\/gitweb.gentoo.org\/repo\/gentoo.git\/commit\/?id=cf8a6a845b68b578772f2ae0d2703f203c6dec33<\/a><\/p>\n<h2>Other coding products<\/h2>\n<h3>Merged ebuilds<\/h3>\n<h4>rocprofiler<\/h4>\n<p>I have bumped <code>dev-util\/rocprofiler<\/code> and its dependencies to version 5.1.3, and fixed proprietary aql profiler lib loading, so ROCm stack on Gentoo stays fully open-sourced without losing most profiling functionalities [14].<\/p>\n<p>[14] <a href=\"https:\/\/github.com\/ROCm-Developer-Tools\/rocprofiler\/issues\/38\">https:\/\/github.com\/ROCm-Developer-Tools\/rocprofiler\/issues\/38<\/a><\/p>\n<h3>Unmerged ebuilds<\/h3>\n<p>Due to limited time and long testing period, ebuilds of ROCm-5.1.3 libraries (ones using rocm.eclass) does not get merged. They can be found in <a href=\"https:\/\/github.com\/gentoo\/gentoo\/pull\/27219\">this PR<\/a>.<br \/>\n<code>dev-libs\/rocm-opencl-runtime<\/code> is a critical package because it provides opencl, and many users still use opencl for GPGPU since HIP is a new stuff. I bumped it to 5.1.3 to match the vanilla clang tool-chain, and enabled its <code>src_test<\/code>, so users can make sure that vanilla clang isn&#8217;t breaking anything. The PR is located <a href=\"https:\/\/github.com\/gentoo\/gentoo\/pull\/26870\">here<\/a>.<\/p>\n<h3>Bug fixes<\/h3>\n<p>Existing bug fixing is also a part of my GSoC. I have created various PRs and closed corresponding bugs on <a href=\"https:\/\/bugs.gentoo.org\/\">Gentoo Bugzilla<\/a>: <a href=\"https:\/\/bugs.gentoo.org\/822828\">#822828<\/a>, <a href=\"https:\/\/bugs.gentoo.org\/853718\">#853718<\/a>, <a href=\"https:\/\/bugs.gentoo.org\/851795\">#851795<\/a>, <a href=\"https:\/\/bugs.gentoo.org\/851792\">#851792<\/a>, <a href=\"https:\/\/bugs.gentoo.org\/852236\">#852236<\/a>, <a href=\"https:\/\/bugs.gentoo.org\/850937\">#850937<\/a>, <a href=\"https:\/\/bugs.gentoo.org\/836248\">#836248<\/a>, <a href=\"https:\/\/bugs.gentoo.org\/836274\">#836274<\/a>, <a href=\"https:\/\/bugs.gentoo.org\/866839\">#866839<\/a>. Also, many bug fixing happens before new packages enter the gentoo main repo, or they are found by myself in the first place, so there is no record on Bugzilla.<\/p>\n<h2>Last but not least, the wiki page<\/h2>\n<p>I have created 3 pages [15-17], filling important information about ROCm. I also received a lot of help from the Gentoo community, mainly focused on refining my wiki to meet the standards.<\/p>\n<p>[15] <a href=\"https:\/\/wiki.gentoo.org\/wiki\/ROCm\">https:\/\/wiki.gentoo.org\/wiki\/ROCm<\/a><br \/>\n[16] <a href=\"https:\/\/wiki.gentoo.org\/wiki\/HIP\">https:\/\/wiki.gentoo.org\/wiki\/HIP<\/a><br \/>\n[17] <a href=\"https:\/\/wiki.gentoo.org\/wiki\/Rocprofiler\">https:\/\/wiki.gentoo.org\/wiki\/Rocprofiler<\/a><\/p>\n<h2>Comparison with original plan<\/h2>\n<p>The original plan in proposal also contained <code>rocm.eclass<\/code>. But it only allocated the last week for &#8220;investigation on vanilla clang&#8221;. In week 1, my mentor and I added &#8220;porting ROCm on vanilla clang&#8221; to the plan, and this became the new major deliverable. Due to the time limit, packaging high level frameworks like pytorch and tensorflow is abandoned. I only worked to get CuPy worked [18], showing rocm.eclass functionality on packages that depend on ROCm libraries.<\/p>\n<p>I think the change of plan and deliverables better annotated the project title &#8220;Refining&#8221;, because what I did greatly improves the quality of existing ebuilds, rather than introducing more ebuilds.<\/p>\n<p>[18] <a href=\"https:\/\/github.com\/littlewu2508\/gentoo\/commit\/3d142fa4b4ada560c053c2fd3c8c1501c82aace2\">https:\/\/github.com\/littlewu2508\/gentoo\/commit\/3d142fa4b4ada560c053c2fd3c8c1501c82aace2<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>12 weeks quickly slips away, and I&#8217;m proud to say that the packaging quality of ROCm in Gentoo does gets improved in this project. Two sets of major deliverables are achieved: New ebuilds of ROCm-5.1.3 tool-chain that purely depends on &hellip; <a href=\"https:\/\/blogs.gentoo.org\/gsoc\/2022\/09\/12\/refining-rocm-packages-in-gentoo-project-summary\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":179,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[9],"tags":[],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/blogs.gentoo.org\/gsoc\/wp-json\/wp\/v2\/posts\/358"}],"collection":[{"href":"https:\/\/blogs.gentoo.org\/gsoc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.gentoo.org\/gsoc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.gentoo.org\/gsoc\/wp-json\/wp\/v2\/users\/179"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.gentoo.org\/gsoc\/wp-json\/wp\/v2\/comments?post=358"}],"version-history":[{"count":41,"href":"https:\/\/blogs.gentoo.org\/gsoc\/wp-json\/wp\/v2\/posts\/358\/revisions"}],"predecessor-version":[{"id":431,"href":"https:\/\/blogs.gentoo.org\/gsoc\/wp-json\/wp\/v2\/posts\/358\/revisions\/431"}],"wp:attachment":[{"href":"https:\/\/blogs.gentoo.org\/gsoc\/wp-json\/wp\/v2\/media?parent=358"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.gentoo.org\/gsoc\/wp-json\/wp\/v2\/categories?post=358"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.gentoo.org\/gsoc\/wp-json\/wp\/v2\/tags?post=358"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}