AMDGPU install errors

Hi, first time poster here… not that that means anything… in need of a little help.

Whilst building AMDGPU I am getting errors as below:

┌ Error: Error building AMDGPU:
│ WARNING: redefinition of constant config_path. This may fail, cause incorrect answers, or produce other errors.
│ WARNING: redefinition of constant previous_config_path. This may fail, cause incorrect answers, or produce other errors.
│ WARNING: replacing module Previous.
│ [ Info: Found useable ld.lld at /usr/bin/ld.lld
│ paths = [“/opt/rocm/hsa/lib”]
│ Could not find library ‘rocblas’.
│ Could not find library ‘rocsparse’.
│ Could not find library ‘rocalution’.
│ Could not find library ‘rocfft’.
│ Could not find library ‘rocrand’.
│ Could not find library ‘MIOpen’.

│ AMDGPU.jl has been built successfully, but there were warnings.
│ Some functionality may be unavailable.

I have rocm in /opt/ and the above libraries seem to be in there (except for “rocalution”) but I am a little unsure how to get AMDGPU to build with those libraries. Is there a way to softlink them or maybe specify their paths? As you can probably tell, I’m a bit new to this, so I am sorry if this is trivial (I hope so) or is the result of my own stupidity; I have spent a couple of days trawling for an answer.

I am on Fedora 33, if that helps.

Regards

Gregg

1 Like

In case anybody else has similar problems… I ended up adding .conf files containing the path for each library to /etc/ld.so.conf.d/ and reloading ldconfig; it worked. However, ADMGPU fails on pretty much everything when tested, so maybe, you know, don’t do it this way.

Cool… I just got yesterday a mini-pc with basic AMD Radeon GPU… if it works I can finally play with GPU computation…

Thanks for the info… :slight_smile: :slight_smile:

Edit:

Note: The integrated GPUs of Ryzen are not officially supported targets for ROCm.

Too bad :frowning:

Hi @saltchunkmary, I’m the author of AMDGPU.jl. What you see above is just a verbose way for AMDGPU to tell you that not all of the ROCm external libraries are available, so some features will be disabled. The build was actually completing successfully (as far as I can tell). This ugly error-ish message is made much nicer on the AMDGPU master branch, which I recommend upgrading to. Regardless of the build warnings, we don’t have support for rocALUTION right now, we just detect it, but that will eventually change when someone adds support for it.

Regarding the “fails on pretty much everything when tested”, could you give me a snippet of one of those errors and the stacktrace for it? It’s probably something simple that I can help you fix.

@sylvaticus that’s technically true, but I’ve been doing basic tests of AMDGPU on my Raven Ridge APU, and it’s usually worked OK. The big issue, though, is that sometimes AMDGPU will somehow cause my display manager to die, and so I need to reboot. I suspect over time AMD will fix these issues so that APUs work perfectly fine with ROCm, but they’re definitely still a bit unstable.

3 Likes

Also, the ld.so approach should be OK, but I would recommend instead setting your LD_LIBRARY_PATH environment variable to the folder containing the .so libraries. The build step checks for ROCm libraries in those paths.

Yeah, support is a bit thin on the ground for AMD and GPU computing. I’m my own worst enemy; I have just started trying to learn how to use GPUs for computing and I had the opportunity to build up my rig with a NVidia card but went the open-source route… I’d probably be well up and running by now with a 1650 Super :slight_smile: Road less traveled and all that, I suppose. Good luck.

1 Like

Hi @jpsamaroo, thanks for working on AMDGPU; if I can get it to work on my setup I’ll be stoked. I finally figured out the LD_LIBRARY_PATH option too, cheers for confirming it. As for the errors; I was on Fedora when I was doing this… Right, long story short, I am attempting to learn how to use GPU computing, mainly out of interest but with an element of maybe folding it into my work at some point. At the same time I figured I might as well learn Julia too (I tried it back in ca. 2013 but ended up staying in R due to my studies/work) as I, correctly or not, like the all under one roof thing it’s got going on… anyway, so, I have set up my box with Arch now and I am getting a completely new set of errors spat out at me. On Fedora, I managed to build AMDGPU with all the libraries, except rocALUTION of course, but it failed when I was attempting some fairly trivial tests (using Flux, I think). So, on Arch the errors are fairly long (I don’t want to drop the entire log as I don;t know how to paste it in a code block yet), here is the top part…

Building AMDGPU → ~/.julia/packages/AMDGPU/lrlUy/deps/build.log
┌ Error: Error building AMDGPU:
│ /build/hip-rocclr/src/HIP-rocm-3.10.0/rocclr/hip_code_object.cpp:120: guarantee(false && “hipErrorNoBinaryForGpu: Coudn’t find binary for current devices!”)

I am guessing it is not picking up my GPU!?

Also, I am currently messing about with my rig and I can throw whichever distro on it that is best for the job, i.e. has the best support and easiest to work with etc… Do you have an recommendations?

Thanks for any help and sorry for the rambling post.

PS. On Arch, I had to rebuild rocSPARSE with some changes to the CMake file, essentially having to :clank- each of the listed GPUs except for my own (gx803) as (I think) it was iteratively replacing each one with the following ID during the build. Anyway, it seemed to work and I have passed all the tests with the rocSPARSE build.

PPS. If it is of any use… when I was on Fedora, my clinfo and rocminfo seemed fine but when I attempted to run Geekbench5 compute it failed with something about +fp16 denormals!?

Thanks again and even more sorrier for the length :slight_smile:

1 Like

That’s odd, my server is also on a GFX803 (RX 480) and I have tested on GFX900 (Vega 56). It looks like it’s a HIP error, which is odd since we don’t usually need HIP except for some basic synchronization of devices for external library usage. What versions of HIP, ROCT, and ROCR are you using? And which Linux kernel?

I personally run Alpine Linux (musl), which people make fun of me for using since musl has poor support generally… but Julia support has improved greatly, and with a small patch to ROCT it’s easy to get it building. Still, Ubuntu, Fedora, Arch, or Gentoo even would be a good choice (and Ubuntu is officially supported by AMD, although I truly hate that distro).

Side note, I just upgraded ROCT and ROCR to ~3.10 and now am getting hangs while creating HSA executables… but 3.7 works, IIRC.

We don’t use FP16 for anything right now (although we will when I get AMDGPU working on Julia 1.6); are you sure it was an error, or just a warning? Right now we have a hack in GPUCompiler.jl which causes some warnings to print when compiling kernels, but they’re generally harmless.

For long logs, I use https://paste.sr.ht

Here is the full error:

Error: Error building AMDGPU:
│ /build/hip-rocclr/src/HIP-rocm-3.10.0/rocclr/hip_code_object.cpp:120: guarantee(false && “hipErrorNoBinaryForGpu: Coudn’t find binary for current devices!”)

│ signal (6): Aborted
│ in expression starting at /home/spike/.julia/packages/AMDGPU/LKloO/deps/build.jl:213
│ gsignal at /usr/bin/…/lib/libc.so.6 (unknown line)
│ abort at /usr/bin/…/lib/libc.so.6 (unknown line)
│ unknown function (ip: 0x7f73be110cb0)
│ unknown function (ip: 0x7f73bdfda910)
│ unknown function (ip: 0x7f73be0062be)
│ unknown function (ip: 0x7f73bdfd9b4f)
│ unknown function (ip: 0x7f73be0a44d8)
│ unknown function (ip: 0x7f73bdfcc9ce)
│ __pthread_once_slow at /usr/bin/…/lib/libpthread.so.0 (unknown line)
│ __hipUnregisterFatBinary at /opt/rocm/hip/lib/libamdhip64.so.3 (unknown line)
│ unknown function (ip: 0x7f73be4aeec1)
│ __cxa_finalize at /usr/bin/…/lib/libc.so.6 (unknown line)
│ unknown function (ip: 0x7f73be369607)
│ call_destructors at /lib64/ld-linux-x86-64.so.2 (unknown line)
│ _dl_catch_exception at /usr/bin/…/lib/libc.so.6 (unknown line)
│ _dl_close_worker at /lib64/ld-linux-x86-64.so.2 (unknown line)
│ _dl_close at /lib64/ld-linux-x86-64.so.2 (unknown line)
│ _dl_catch_exception at /usr/bin/…/lib/libc.so.6 (unknown line)
│ _dl_catch_error at /usr/bin/…/lib/libc.so.6 (unknown line)
│ unknown function (ip: 0x7f7441266b88)
│ dlclose at /usr/bin/…/lib/libdl.so.2 (unknown line)
│ dlclose at /build/julia/src/julia-1.5.3/usr/share/julia/stdlib/v1.5/Libdl/src/Libdl.jl:157 [inlined]
│ find_library at /build/julia/src/julia-1.5.3/usr/share/julia/stdlib/v1.5/Libdl/src/Libdl.jl:200
│ find_library at /build/julia/src/julia-1.5.3/usr/share/julia/stdlib/v1.5/Libdl/src/Libdl.jl:206 [inlined]
│ find_library at /build/julia/src/julia-1.5.3/usr/share/julia/stdlib/v1.5/Libdl/src/Libdl.jl:206
│ find_roc_library at /home/spike/.julia/packages/AMDGPU/LKloO/deps/build.jl:104 [inlined]
│ main at /home/spike/.julia/packages/AMDGPU/LKloO/deps/build.jl:170
│ unknown function (ip: 0x7f742abe45dd)
│ unknown function (ip: 0x7f7441344735)
│ unknown function (ip: 0x7f74413443be)
│ unknown function (ip: 0x7f7441344f00)
│ unknown function (ip: 0x7f74413459b0)
│ unknown function (ip: 0x7f7441361cf1)
│ unknown function (ip: 0x7f7441337d32)
│ jl_load_rewrite at /usr/bin/…/lib/libjulia.so.1 (unknown line)
│ include at ./client.jl:457
│ unknown function (ip: 0x7f7441344735)
│ unknown function (ip: 0x7f74413443be)
│ unknown function (ip: 0x7f7441344f00)
│ unknown function (ip: 0x7f74413459b0)
│ unknown function (ip: 0x7f7441361cf1)
│ unknown function (ip: 0x7f7441362398)
│ jl_toplevel_eval_in at /usr/bin/…/lib/libjulia.so.1 (unknown line)
│ unknown function (ip: 0x7f74309782d1)
│ unknown function (ip: 0x7f74306cc007)
│ unknown function (ip: 0x7f74306cd97e)
│ unknown function (ip: 0x7f74306cdad5)
│ unknown function (ip: 0x5611f98974fe)
│ unknown function (ip: 0x5611f98970a7)
│ __libc_start_main at /usr/bin/…/lib/libc.so.6 (unknown line)
│ unknown function (ip: 0x5611f989715d)
│ Allocations: 842469 (Pool: 842123; Big: 346); GC: 1
└ @ Pkg.Operations /build/julia/src/julia-1.5.3/usr/share/julia/stdlib/v1.5/Pkg/src/Operations.jl:949

Sorry if it too long.

As for the versions, ROCT, HIP and ROCR are all 3.10, although miopen-hip is 3.8. My kernel is 5.9.13-arch-1-1. I’ll have a crack at it again using 3.7 and see how that goes. Yeah, I’m pretty new to GNU/Linux, I have migrated from my 15 year-old Mac and built up my own PC. I’m with you, from my little experience in all-things-linux, I really don’t dig Ubuntu. I’d prefer to struggle through stuff like this than go down that route; I might as well have just upgraded my Mac with the new silicon and all that. I really like Fedora but I found it quite glitchy with the ROCm build, whilst Arch was a learning experience to say the least. I almost want to keep it on here just so I never need to go through it again! Anyway, I’ve got a lot to learn before I can start comparing them all from the point of view of the internals and the like, at the moment it is just arbitrary “flavour” that I’m going with. Again, thanks a lot for this.

1 Like

Right, me again… I have managed to get it built and working (from what I can tell). Long-story-short, I think it was a case of my stupidity; I checked all of my ROCm packages and found that I had rocalution at version 3.7, whilst rocm-dev, rocm-dkms and rocm-libs all at 3.8. I reinstalled them all from AUR (I think my originals were from the arch4edu repository) and I’ll be damned, AMDGPU built up no probs. I’ll try and get Flux built up now (does it work with AMD?) and crack on with the learning. Thanks again for your time and help @jpsamaroo, it’s boss getting this up and running!

It looks like, when Libdl tries to open and then close one of the ROCm external libs, they open and close HIP, and HIP does something in its library destructor that isn’t guaranteed to succeed when we haven’t done anything more than dlopen the library… definitely a bug in HIP, IMO. I would recommend we report this upstream.

In the meantime, you can change deps/build.jl line 170 from config[lib] = find_roc_library("lib$name") to config[lib] = nothing, which will disable searching for ROCm external libs.

Oh cool! I wrote the above before reading your last post, so that’s great to hear!

We don’t have Flux support right now, unfortunately, because broadcasting and mapreduce isn’t working for the ROCArray. It might not be hard to put back together (I had a Flux PR open for this originally), but I’ll need to get a working GPUArrays.mapreducedim! implementation put together.

1 Like

Thanks for your help with it all; it had become something I wanted to figure out even if it broke all my other dependencies to get it built up! i wish I knew a little, well a lot more about it all as I would have a crack at contributing myself but it is way over my head. I might start having a look through it all and try and see if I can understand what’s going on, but I’m way off being able to do a thing to help. Anyways, I’m just stoked to have got it built up and working; now to attempt to try and migrate some of my work from R and Stan to Julia for a kickoff is more within my grasp (maybe). Thanks again for your work and help with this.

1 Like