Week 11 Report for Refining ROCm Packages in Gentoo

My progress this week is mainly writing wiki and refining rocm.eclass.

Although the current eclass can work with my new ebuilds [1], Michał Górny has pointed out various flaws on the Github PR [2]. He also pointed out the necessity about rocm.eclass, because it seems like a combination of two eclasses. In my opinion, rocm.eclass has its value, mainly for handling USE_EXPANDS and common phase functions. The ugly part is mainly in rocm_src_test: due to the inconsistency of test methods of packages in [3], I have to detect which method is using and do it accordingly. So my plan is to split the one-size-fits-all rocm_src_test into two functions, corresponding to two scenarios (cmake test or standalone binary), and let each ebuild decide which to use. This can avoid detailed detection code that make rocm_src_test bloated.

Wiki writing: I think the main part of ROCm wiki[1] and HIP[2] is nearly finished. But due to the delay of rocm.eclass, the related information is not appended (ROCm#Developing guide). There is also a section a reserved: ROCm#Installation guide. I have little clue on how to write this part, because ROCm is a wide collection of packages. Maybe a meta package (there are users working on this) would be helpful.

To be honest I’m a bit anxious, because there is only one week left, but there are still a lot to be determined and tested on rocm.eclass along with the sci-libs/roc* ebuilds. I hope I can resolve these core issues in the last week.

[1] https://github.com/littlewu2508/gentoo/tree/rocm-5.1.3-scilibs
[2] https://github.com/gentoo/gentoo/pull/26784
[3] https://github.com/ROCmSoftwarePlatform
[4] https://wiki.gentoo.org/wiki/ROCm
[5] https://wiki.gentoo.org/wiki/HIP

Posted in ROCm Packages | Leave a comment

Week 10 Report for Refining ROCm Packages in Gentoo

This week I have leant a lot from Ulrich’s comments on rocm.eclass. I polished the eclass to v3 and send to gentoo-dev mailing list. However, I observed another error introduced in v3, and I’ll include a fix for it in the v4 in the following days.

Another half of my time is spent on testing sci-libs/roc-* packages on various platforms, utilizing rocm.eclass. I can say that rocm.eclass did its job as expected, so I believe after v4 it can be merged.

With src_test enabled, I have found various test failures. rocBLAS-5.1.3 fails 3 tests on Radeon RX 6700XT, slightly exceeding tolerance, which seems not a big issue; rocFFT-5.1.3 fails 16 suites on Radeon VII [1], which is serious and confirmed by upstream, so I suggest masking <code>amdgpu_targets_gfx906</code> USE flag for rocFFT-5.1.3; just today I observe MIOpen is failing many tests, probably due to vanilla clang. I’ll open issues and report those test failures to upstream. Running tests suite takes a lot of time, and often drain the GPU. It may takes more than 15 hours testing rocBLAS, even on performant CPU like Ryzen 5950X. If I use the GPU to render graphics (run a desktop environment) and do test simultaneously, it often result in amdgpu driver failure. I hope one day we can have a testing farm for ROCm packages, but that would be expensive because there are a lot of GPU architectures, and the compilation takes a lot of time.

I planned to finish the draft of wiki pages [2,3], but turns out I’m running out of time. I’ll catch up in week 11. My mentor is also busy in week 10, so my PR about rocm-opencl-runtime is still pending for review. Now we are working on solving the dependency issue of ROCm packages — gcc-12 and gcc-11.3.0 incompatibilities. Due to two bugs, the current stable gcc, gcc-11.3.0 cannot compile some ROCm packages [4], and the current unstable gcc, gcc-12, is unable to compile nearly all ROCm packages [5].

I’ll continue to do what’s postponed in week 10 — landing rocm.eclass and sci-libs packages, preparing cupy, fixing bugs, and writing the wiki pages. I’ll investigate MIOpen’s situation as well.

[1] https://github.com/ROCmSoftwarePlatform/rocFFT/issues/369
[2] https://wiki.gentoo.org/wiki/ROCm
[3] https://wiki.gentoo.org/wiki/HIP
[4] https://bugs.gentoo.org/842405
[5] https://bugs.gentoo.org/857660

Posted in ROCm Packages | Leave a comment

Week 9 Report for Refining ROCm Packages in Gentoo

This week I mainly focused on dev-libs/rocm-opencl-runtime.

I bumped dev-libs/rocm-opencl-runtime to 5.1.3. That’s relatively easy. The difficult part is enabling its tests. I came across a major problem, which is oclgl test requiring X server. I compiled using debug options and use gdb to dive into the code, but found there is no simple solution. Currently the test needs a X server where OpenGL vender is AMD. Xvfb only provides llvmpipe, not meeting the requirements. I consulted some friends, they said NVIDIA recommends using EGL when there is no X [1], but apparently ROCm can only get OpenGL from X [2]. So my workaround is to let user passing an X display into the ebuild, by reading the environment variable OCLGL_DISPLAY (DISPLAY variable will be wiped when calling emerge, while this can survive). If no display is detected, or glxinfo shows the OpenGL vendor is not AMD, then src_test dies, throwing indications about running an X server using amdgpu driver.

I was also trapped by CRLF problem in src_test of dev-libs/rocm-opencl-runtime. Tests in oclperf.exclude should be skipped for oclperf test, but it did not. After numerous trials, I finally found that this file is using CRLF, not LF, which causes the exclusion failed 🙁

Nevertheless, rocm-opencl-runtime tests passed on Radeon RX 6700XT! A good thing, because I know many user in Gentoo rely on this package to provide opencl in their computation, and the correctness is vital. Before we does not have src_test enabled. The PR is now in [6].

Other works including starting wiki writing [3,4], refine rocm.eclass according to feedback (not much, see gentoo-dev mailing list), and found a bug of dev-util/hipFindHIP.cmake module is not in the correct place. Fix can be found in [5] but I need to further polish the patch before PR.

If no further suggestions on rocm.eclass, I’ll land rocm.eclass in ::gentoo next week, and start bumping the sci-libs version already done locally.

[1] https://developer.nvidia.com/blog/egl-eye-opengl-visualization-without-x-server/
[2] https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/blob/bbdc87e08b322d349f82bdd7575c8ce94d31d276/tests/ocltst/module/common/OCLGLCommonLinux.cpp
[3] https://wiki.gentoo.org/wiki/ROCm
[4] https://wiki.gentoo.org/wiki/HIP
[5] https://github.com/littlewu2508/gentoo/tree/hip-correct-cmake
[6] https://github.com/gentoo/gentoo/pull/26870

Posted in ROCm Packages | Leave a comment

Week 8 Report for Refining ROCm Packages in Gentoo

This week there are two major progress: dev-util/rocprofiler and rocm.eclass.

I have implemented all the functions I think necessary for rocm.eclass. It was just send to rocm.eclass draft to gentoo-dev mailing list (also with a Github PR at [1]), please have a review. In the following weeks, I will collect feedbacks and continue to polish it.

In summary, I have implemented those functions which is listed in my proposal:
USE_EXPNAD of amdgpu_targets_, and ROCM_USEDEP to make the use flag coherent among dependencies;
rocm_src_configure contains common arguments in src_prepare;
rocm_src_test which checks the permission on /dev/kfd and /dev/dri/render*

There are also something listed in proposal but I decided not to implement now:
rocm_src_prepare: although there are some similarities among ebuilds, src_prepare are highly customized to each ROCm components. Unifying would take extra work.
SRC_URI: currently all SRC_URI is already specified in each ebuilds. It does not hurt to keep the status quo.

Moreover, during implementation I found another feature necessary
rocm_src_test: correctly handles different scenarios. ROCm packages may have cmake test, which can be run using cmake_src_test, or only compiled some testing binaries which requires execution from command-line. I made rocm_src_test automatically detect the method, so ROCm packages just have to call this function directly without doing anything.

Actually I have never imagined rocm.eclass could be in this shape eventually. Initially I just thought it would provide some utilities, mainly src_test and USE_EXPAND. But when implementing I found all these feature requires careful treatment. The comments (mainly examples) also takes half of the length. It ends up in 278 lines, which is a middle-sized among current eclasses. Maybe it can be further trimmed down after polishing, because there could be awkward implementations or re-inventions in it.

Based on my draft rocm.eclass, I have prepared sci-libs/roc*=5.1.3, sci-lib/hip-*-5.1.3 and dev-python/cupy making use of it. It feels great to simplify the ebuilds, and portage can handles the USE_EXPAND and dependencies just as expected. Once the rocm.eclass get in tree, I’ll push those ROCm-5.1.3 ebuilds.

Anther thing to mention is that ROCm-5.1.3 toolchains finally get merged [5], with the fixed dev-util/rocprofiler-{4.3.0,5.0.2,5.1.3}. rocprofiler is actually buggy before, because I thought I committed the patch which stripped the libhsa-amd-aqlprofile.so loading (I even claimed it in the commit message), but it was not committed and lost in history. So I reproduced the patch. Also, I did some research about this proprietary lib. By default, not loading it means tracing hsa/hip is not possible — you only get basic information like name and time of each GPU kernel execution, but do not know the pipeline of kernel execution (which one has spawned which kernel). AQL should be HSA architected queuing language (HSA AQL), where https://llvm.org/docs/AMDGPUUsage.html#hsa-aql-queue documented. It did sound related to the pipeline of kernel dispatching. By the description, libhsa-amd-aqlprofile.so is an extension API of AQL Profile. But actually, patching the source code to let rocprofiler not loading libhsa-amd-aqlprofile.so does not breaks the tracing of hsa/hip. So, I’m not sure why libhsa-amd-aqlprofile.so is needed, and raised a question at [2]. So I complete the fix in [3,4].

According to the renewed proposal (I have been leaving for two weeks, so there are changes in plan), I should collect feedback and refine rocm.eclass, and prepare dev-python/cupy and sci-libs/rocWMMA. I’ll investigate ROCgdb, too. Also, rocm-device-libs is a major package because many users relies on it to provide opencl. I’ll work on bumping its version, too. What’s more, with hip-5.1.3 against vanilla clang, rocm for blender can land in ::gentoo.

[1] https://github.com/gentoo/gentoo/pull/26784
[2] https://github.com/RadeonOpenCompute/ROCm/issues/1781
[3] https://github.com/gentoo/gentoo/pull/26755
[4] https://github.com/gentoo/gentoo/pull/26771
[5] https://github.com/gentoo/gentoo/pull/26441

Posted in ROCm Packages | Leave a comment

Week 12 Report for RISC-V Support for Gentoo Prefix

Hello all,
Hope you all are doing good, this is my report for 12th week of my Google Summer of Code project.

I got documentation on Porting Prefix reviewed and I have added the suggested changes.

My GSoC delieverables have been completed, so I played around with the compatibility layer and ansible. Synced the latest changes to the bootstrap script from upstream and used it for installing prefix. Working on updating the main.yml[1] accordingly. The process has been smooth so far, within next few weeks we might have a working compatibility layer for RISC-V.

Will start working on the final report and update the blogs on Gentoo Blog site. Although the official period is over I will continue working on compatibility layer and there are also few other things like pkgcraft in my bucket list which I will get my hands on.

The 12 weeks of GSoC have been super fun, thanks to mentors and the community.

[1] https://github.com/EESSI/compatibility-layer/blob/main/ansible/playbooks/roles/compatibility_layer/defaults/main.yml

Regards,
wiredhikari

Posted in RISC-V Prefix | Leave a comment

Gentoo musl Support Expansion for Qt/KDE Week 12

This week has been mostly been spent on writing documentation and fixing up some left over things.

I started with looking over the *-standalone libraries. It turns out that tree.h is provided by libbsd and because libbsd works just fine on musl I removed the standalone. The second thing I did was removing error.h because it caused issues with some builds, and we suspect it works on Void Linux because they build packages inside a clean chroot (without error.h). The only one left is now cdefs.h. This header is an internal glibc header, and using it is basically a bug, so upstreaming fixes should be very easy. Therefore I feel like this doesn’t need to be added either, so I closed the pull request for now.

Next I rewrote Sam’s musl porting notes, moving it from his personal page to a “real” wiki page (https://wiki.gentoo.org/wiki/Musl_porting_notes). It’s now more like a wiki page and less like a list of errors with attached fixes. I’ve also added several things myself into it.

Another wiki I’ve added stuff to is Chroot (https://wiki.gentoo.org/wiki/Chroot#Sound_and_graphics). In my GSoC planning I wanted to write documentation about using Gentoo musl. There I wanted information about how to work around using glibc programs that do not work on musl, ex proprietary programs. Instead of doing that I wrote documentation about how running graphical applications with sound into the Chroot documentation, as it helps every Gentoo user. I don’t think Gentoo musl users should have any issues finding the Chroot wikipage. 🙂

I have also tested gettext-tiny on Gentoo musl. This is a smaller implementation of gettext with some functionality stubbed out. gettext-tiny is built for musl, and it makes use of the libintl provided by musl. For users that only want English this makes a lot of sense because it is much smaller than gettext but still allows most packages to be built. When replacing gettext Portage complained about two packages using uninstalled libraries from GNU gettext, those being bison and poxml. When reemerging bison it errored out and I was sure it was because of gettext, but after debugging bison I found out it was caused by error-standalone. After unmerging error-standalone bison detected that the library was not installed and it compiled correctly. Poxml on the other hand hard depends on libgettextpo, a library not provided by gettext-tiny. Running “equery d -a poxml” however we can see that nothing important actually depends on poxml, so gettext-tiny should for the most part be fine.

$ equery d -a poxml
* These packages depend on poxml:
kde-apps/kdesdk-meta-22.04.3-r1 (>=kde-apps/poxml-22.04.3:5)
kde-apps/kdesdk-meta-22.08.0 (>=kde-apps/poxml-22.08.0:5)

Next week I will write my final evaluation and then I am done with GSoC! I will however continue working with some things like ebuildshell and crossdev when I have time.

Posted in musl KDE | Leave a comment

Gentoo musl Support Expansion for Qt/KDE Week 11

This week has mostly been dedicated to fixing old, and harder problems that I had previously put off. I spent a whole lot of time learning about the AccountsService codebase and setting up systems with LDAP authentication, but it turned out it didn’t need a rewrite after reading a couple of issues on the GitLab page, more on that later.
Continue reading

Posted in musl KDE | Leave a comment

Week 11 Report for RISC-V Support for Gentoo Prefix

Hello all,

Hope everyone is fine. This is my report for the 11th week of my GSoC project. This week I worked on documentation, closing dangling pr’s and looked into bootstrapping the EESSI compat layer for RISC-V. I spent some of my time learning Ansible as a part of the process.

The documentation[1] is almost complete, I will work on feedbacks of mentors and pass it through some review softwares and fix accordingly. In the upcoming week I will look into EESSI compat layer for RISC-V and a blog for end-term evaluations.

[1] https://github.com/wiredhikari/prefix_on_riscv/blob/main/docs/porting.md

Regards,

wiredhikari

Posted in RISC-V Prefix | Leave a comment

Week 10 Report for RISC-V Support for Gentoo Prefix

Hello all,
Hope everyone is doing good. This is my report for the 10th week of my GSoC project. This week I worked on testing packages with all keywords in app-portage[1] and app-arch[2]. Also working on testing sci-physics/geant, as it has several flags so its taking quite sometime .

 

Upcoming week I will do the same for sys-apps. My mentor has also shared with me the list of packages they use in Prefix, so we can test them on RISC-V, we will test them as well. I continued working on documentation of “Porting Prefix” and I will get it reviewed this week. Also will get reviews for the dangling PR’s like [4] and [5]

[1] https://github.com/gentoo/gentoo/pull/26679
[2] https://github.com/gentoo/gentoo/pull/26725
[3] https://github.com/gentoo/gentoo/pull/26902
[4] https://github.com/gentoo/gentoo/pull/26848
[5] https://github.com/gentoo/gentoo/pull/26853


Regards,
wiredhikari

Posted in RISC-V Prefix | Leave a comment

Gentoo musl Support Expansion for Qt/KDE Week 10

This week I’ve finished testing the KDE applications, cleaned up the Mauikit ebuild commits, and various fixes on Portage and Crossdev. I have also started writing a little bit on the PinePhone Pro wiki, and also gotten Gentoo musl installed on riscv64 (VisionFive).

I’ll start with mentioning the KDE app tests. I made a simple application called gdepend that takes any number of package atoms as its command line arguments, and prints out a space separated list of all their dependencies combined. This was useful because kde-apps-meta is a “meta-meta” package, meaning it depends on other meta packages, so this allowed me to do “FEATURES=test emerge $(gdepend $(gdepend kde-apps-meta))” to test every KDE application. It is basically equery depend, but prints the output in an emerge-friendly way.

Continue reading

Posted in musl KDE | Leave a comment