RCCL wrapping

Hi!
I’m trying to wrap the Radeon collective communication library (RCCL), the ROCm counterpart to NCCL. I’m taking heavy inspiration from NCCL.jl regarding wrapping strategy since RCCL itself closely mimics NCCL. I have questions about the artifacts though: NCCL.jl depends on an NCCL_jll.jl and I can see that CUDA.jl also lets you choose between downloading artifacts or using the local toolkit. However, ROCm usually comes with RCCL bundled up with it and AMDGPU.jl already basically only uses local (non-jll) libraries. There is a RCCL repo on GitHub though. I guess my questions are:

  • Should I go the JLL route and create a script to put on Yggdrasil to build RCCL off their GitHub and wrap that one or do it the AMDGPU.jl-route and do pure local discovery?
  • If so, can someone explain me how the AMDGPU.jl discovery process & wrapping exactly works? My understanding is that the discovery.jl script has some functions to locate all the libraries like rocBLAS etc. and when the module is included the code inside __init__() exports the paths as global variables called librocblas, librocfft and so on. How do the various librocFFT.jl, librocBLAS.jl etc. know how to call those particular .so files? I don’t see any Libdl.dlopen() calls anywhere.

Sorry if these questions are basic, this is definitely harder than anything I’ve done so far in Julia and I’m hoping to learn a lot while doing it!

1 Like

Welcome! This would definitely be nice to have, thanks for taking the initiative!

This will probably be a rather large undertaking. I looked into building ROCm on Yggdrasil before, but they have a very specific build setup and AFAIK, upstream doesn’t support cross-compilation. I’d go with local discovery for now.

Yes, that’s correct

They are simply passed directly to the ccall, for example here. You don’t usually need to call dlopen yourself.
Note that these wrappers are automatically generated using Clang.jl via the scripts located in AMDGPU.jl/gen at master · JuliaGPU/AMDGPU.jl · GitHub, so I would recommend a similar approach for NCCL.

Yes, I can imagine, what I was referring to was this repo which seems to be only RCCL. But it’s not clear how decoupled it actually is from the rest of ROCm so it makes a lot of sense to go with local discovery.

Aaah I see, so when I prepend e.g. librocfft to the @ccall function call, it is the macro that takes care to “open” that path and find the function inside. Thanks, it was this step that I was missing.

Yes, I am using Clang.jl and tried to get something as close to NCCL.jl as possible since even function names are the same. Thank you very much for your pointers. I’ll keep you posted about my progress!

2 Likes

Hi!
Bit of an update. I managed to automatically generate bindings with Clang.jl and built higher-level abstractions around them. Local discovery works as well. I pretty much copied the tests from NCCL.jl and tried to test it on an HPC cluster with 4 MI 250X per node. Some of the tests fail and I can’t figure out why. The full output is the following:

 Info: RCCL info: 2.22.3
sum: Error During Test at /home/fra/.julia/dev/RCCL/test/runtests.jl:31
  Got exception outside of a @test
  NCCLError(code ncclUnhandledCudaError, a call to a CUDA function failed)
  Stacktrace:
    [1] check
      @ ~/.julia/dev/RCCL/src/librccl.jl:27 [inlined]
    [2] ncclGroupEnd
      @ ~/.julia/dev/RCCL/src/librccl.jl:42 [inlined]
    [3] groupEnd
      @ ~/.julia/dev/RCCL/src/group.jl:14 [inlined]
    [4] group(f::var"#1#7"{Vector{HIPDevice}, Vector{Communicator}, Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}})
      @ RCCL ~/.julia/dev/RCCL/src/group.jl:26
    [5] macro expansion
      @ ~/.julia/dev/RCCL/test/runtests.jl:40 [inlined]
    [6] macro expansion
      @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
    [7] macro expansion
      @ ~/.julia/dev/RCCL/test/runtests.jl:32 [inlined]
    [8] macro expansion
      @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
    [9] macro expansion
      @ ~/.julia/dev/RCCL/test/runtests.jl:28 [inlined]
   [10] macro expansion
      @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
   [11] top-level scope
      @ ~/.julia/dev/RCCL/test/runtests.jl:13
   [12] include(fname::String)
      @ Main ./sysimg.jl:38
   [13] top-level scope
      @ none:6
  
  caused by: NCCLError(code ncclUnhandledCudaError, a call to a CUDA function failed)
  Stacktrace:
    [1] check
      @ ~/.julia/dev/RCCL/src/librccl.jl:27 [inlined]
    [2] ncclAllReduce(sendbuff::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, recvbuff::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, count::Int64, datatype::RCCL.LibRCCL.ncclDataType_t, op::RCCL.LibRCCL.ncclRedOp_t, comm::Communicator, stream::Ptr{Nothing})
      @ RCCL.LibRCCL ~/.julia/dev/RCCL/src/librccl.jl:42
    [3] Allreduce!(sendbuf::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, recvbuf::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, op::Function, comm::Communicator; stream::HIPStream)
      @ RCCL ~/.julia/dev/RCCL/src/collective.jl:16
    [4] Allreduce!(sendbuf::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, recvbuf::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, op::Function, comm::Communicator)
      @ RCCL ~/.julia/dev/RCCL/src/collective.jl:10
    [5] (::var"#1#7"{Vector{HIPDevice}, Vector{Communicator}, Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}})()
      @ Main ~/.julia/dev/RCCL/test/runtests.jl:42
    [6] group(f::var"#1#7"{Vector{HIPDevice}, Vector{Communicator}, Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}})
      @ RCCL ~/.julia/dev/RCCL/src/group.jl:24
    [7] macro expansion
      @ ~/.julia/dev/RCCL/test/runtests.jl:40 [inlined]
    [8] macro expansion
      @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
    [9] macro expansion
      @ ~/.julia/dev/RCCL/test/runtests.jl:32 [inlined]
   [10] macro expansion
      @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
   [11] macro expansion
      @ ~/.julia/dev/RCCL/test/runtests.jl:28 [inlined]
   [12] macro expansion
      @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
   [13] top-level scope
      @ ~/.julia/dev/RCCL/test/runtests.jl:13
   [14] include(fname::String)
      @ Main ./sysimg.jl:38
   [15] top-level scope
      @ none:6
RCCL.avg: Error During Test at /home/fra/.julia/dev/RCCL/test/runtests.jl:53
  Got exception outside of a @test
  NCCLError(code ncclUnhandledCudaError, a call to a CUDA function failed)
  Stacktrace:
    [1] check
      @ ~/.julia/dev/RCCL/src/librccl.jl:27 [inlined]
    [2] ncclGroupEnd
      @ ~/.julia/dev/RCCL/src/librccl.jl:42 [inlined]
    [3] groupEnd
      @ ~/.julia/dev/RCCL/src/group.jl:14 [inlined]
    [4] group(f::var"#2#8"{Vector{HIPDevice}, Vector{Communicator}, Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}})
      @ RCCL ~/.julia/dev/RCCL/src/group.jl:26
    [5] macro expansion
      @ ~/.julia/dev/RCCL/test/runtests.jl:62 [inlined]
    [6] macro expansion
      @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
    [7] macro expansion
      @ ~/.julia/dev/RCCL/test/runtests.jl:54 [inlined]
    [8] macro expansion
      @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
    [9] macro expansion
      @ ~/.julia/dev/RCCL/test/runtests.jl:28 [inlined]
   [10] macro expansion
      @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
   [11] top-level scope
      @ ~/.julia/dev/RCCL/test/runtests.jl:13
   [12] include(fname::String)
      @ Main ./sysimg.jl:38
   [13] top-level scope
      @ none:6
  
  caused by: NCCLError(code ncclUnhandledCudaError, a call to a CUDA function failed)
  Stacktrace:
    [1] check
      @ ~/.julia/dev/RCCL/src/librccl.jl:27 [inlined]
    [2] ncclAllReduce
      @ ~/.julia/dev/RCCL/src/librccl.jl:42 [inlined]
    [3] #Allreduce!#8
      @ ~/.julia/dev/RCCL/src/collective.jl:16 [inlined]
    [4] Allreduce!(sendbuf::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, recvbuf::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, op::RCCL.LibRCCL.ncclRedOp_t, comm::Communicator)
      @ RCCL ~/.julia/dev/RCCL/src/collective.jl:10
    [5] (::var"#2#8"{Vector{HIPDevice}, Vector{Communicator}, Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}})()
      @ Main ~/.julia/dev/RCCL/test/runtests.jl:64
    [6] group(f::var"#2#8"{Vector{HIPDevice}, Vector{Communicator}, Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}})
      @ RCCL ~/.julia/dev/RCCL/src/group.jl:24
    [7] macro expansion
      @ ~/.julia/dev/RCCL/test/runtests.jl:62 [inlined]
    [8] macro expansion
      @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
    [9] macro expansion
      @ ~/.julia/dev/RCCL/test/runtests.jl:54 [inlined]
   [10] macro expansion
      @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
   [11] macro expansion
      @ ~/.julia/dev/RCCL/test/runtests.jl:28 [inlined]
   [12] macro expansion
      @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
   [13] top-level scope
      @ ~/.julia/dev/RCCL/test/runtests.jl:13
   [14] include(fname::String)
      @ Main ./sysimg.jl:38
   [15] top-level scope
      @ none:6
Broadcast!: Error During Test at /home/fra/.julia/dev/RCCL/test/runtests.jl:76
  Got exception outside of a @test
  NCCLError(code ncclUnhandledCudaError, a call to a CUDA function failed)
  Stacktrace:
    [1] check
      @ ~/.julia/dev/RCCL/src/librccl.jl:27 [inlined]
    [2] ncclGroupEnd
      @ ~/.julia/dev/RCCL/src/librccl.jl:42 [inlined]
    [3] groupEnd
      @ ~/.julia/dev/RCCL/src/group.jl:14 [inlined]
    [4] group(f::var"#3#9"{Int64, Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}, Vector{Communicator}, Vector{HIPDevice}})
      @ RCCL ~/.julia/dev/RCCL/src/group.jl:26
    [5] macro expansion
      @ ~/.julia/dev/RCCL/test/runtests.jl:87 [inlined]
    [6] macro expansion
      @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
    [7] macro expansion
      @ ~/.julia/dev/RCCL/test/runtests.jl:77 [inlined]
    [8] macro expansion
      @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
    [9] top-level scope
      @ ~/.julia/dev/RCCL/test/runtests.jl:13
   [10] include(fname::String)
      @ Main ./sysimg.jl:38
   [11] top-level scope
      @ none:6
  
  caused by: NCCLError(code ncclUnhandledCudaError, a call to a CUDA function failed)
  Stacktrace:
    [1] check
      @ ~/.julia/dev/RCCL/src/librccl.jl:27 [inlined]
    [2] ncclBroadcast
      @ ~/.julia/dev/RCCL/src/librccl.jl:42 [inlined]
    [3] #Broadcast!#10
      @ ~/.julia/dev/RCCL/src/collective.jl:46 [inlined]
    [4] (::var"#3#9"{Int64, Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}, Vector{Communicator}, Vector{HIPDevice}})()
      @ Main ~/.julia/dev/RCCL/test/runtests.jl:89
    [5] group(f::var"#3#9"{Int64, Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}, Vector{Communicator}, Vector{HIPDevice}})
      @ RCCL ~/.julia/dev/RCCL/src/group.jl:24
    [6] macro expansion
      @ ~/.julia/dev/RCCL/test/runtests.jl:87 [inlined]
    [7] macro expansion
      @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
    [8] macro expansion
      @ ~/.julia/dev/RCCL/test/runtests.jl:77 [inlined]
    [9] macro expansion
      @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
   [10] top-level scope
      @ ~/.julia/dev/RCCL/test/runtests.jl:13
   [11] include(fname::String)
      @ Main ./sysimg.jl:38
   [12] top-level scope
      @ none:6
ReduceScatter!: Error During Test at /home/fra/.julia/dev/RCCL/test/runtests.jl:147
  Got exception outside of a @test
  NCCLError(code ncclUnhandledCudaError, a call to a CUDA function failed)
  Stacktrace:
    [1] check
      @ ~/.julia/dev/RCCL/src/librccl.jl:27 [inlined]
    [2] ncclGroupEnd
      @ ~/.julia/dev/RCCL/src/librccl.jl:42 [inlined]
    [3] groupEnd
      @ ~/.julia/dev/RCCL/src/group.jl:14 [inlined]
    [4] group(f::var"#6#12"{Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}, Vector{Communicator}, Vector{HIPDevice}})
      @ RCCL ~/.julia/dev/RCCL/src/group.jl:26
    [5] macro expansion
      @ ~/.julia/dev/RCCL/test/runtests.jl:157 [inlined]
    [6] macro expansion
      @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
    [7] macro expansion
      @ ~/.julia/dev/RCCL/test/runtests.jl:148 [inlined]
    [8] macro expansion
      @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
    [9] top-level scope
      @ ~/.julia/dev/RCCL/test/runtests.jl:13
   [10] include(fname::String)
      @ Main ./sysimg.jl:38
   [11] top-level scope
      @ none:6
  
  caused by: NCCLError(code ncclUnhandledCudaError, a call to a CUDA function failed)
  Stacktrace:
    [1] check
      @ ~/.julia/dev/RCCL/src/librccl.jl:27 [inlined]
    [2] ncclReduceScatter(sendbuff::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, recvbuff::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, recvcount::Int64, datatype::RCCL.LibRCCL.ncclDataType_t, op::RCCL.LibRCCL.ncclRedOp_t, comm::Communicator, stream::Ptr{Nothing})
      @ RCCL.LibRCCL ~/.julia/dev/RCCL/src/librccl.jl:42
    [3] ReduceScatter!(sendbuf::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, recvbuf::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, op::Function, comm::Communicator; stream::HIPStream)
      @ RCCL ~/.julia/dev/RCCL/src/collective.jl:128
    [4] ReduceScatter!(sendbuf::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, recvbuf::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, op::Function, comm::Communicator)
      @ RCCL ~/.julia/dev/RCCL/src/collective.jl:122
    [5] (::var"#6#12"{Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}, Vector{Communicator}, Vector{HIPDevice}})()
      @ Main ~/.julia/dev/RCCL/test/runtests.jl:159
    [6] group(f::var"#6#12"{Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}, Vector{Communicator}, Vector{HIPDevice}})
      @ RCCL ~/.julia/dev/RCCL/src/group.jl:24
    [7] macro expansion
      @ ~/.julia/dev/RCCL/test/runtests.jl:157 [inlined]
    [8] macro expansion
      @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
    [9] macro expansion
      @ ~/.julia/dev/RCCL/test/runtests.jl:148 [inlined]
   [10] macro expansion
      @ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
   [11] top-level scope
      @ ~/.julia/dev/RCCL/test/runtests.jl:13
   [12] include(fname::String)
      @ Main ./sysimg.jl:38
   [13] top-level scope
      @ none:6
Test Summary:    | Pass  Error  Total   Time
RCCL.jl          |   11      4     15  16.0s
  Communicator   |    7             7   0.8s
  Allreduce!     |           2      2  13.7s
    sum          |           1      1  13.5s
    RCCL.avg     |           1      1   0.1s
  Broadcast!     |           1      1   0.2s
  Reduce!        |    2             2   0.4s
  Allgather!     |    2             2   0.3s
  ReduceScatter! |           1      1   0.4s
ERROR: LoadError: Some tests did not pass: 11 passed, 0 failed, 4 errored, 0 broken.
in expression starting at /home/fra/.julia/dev/RCCL/test/runtests.jl:11
error in running finalizer: ErrorException("Trying to free active handle that is not managed by cache.\n- Key: HIPContext(ptr=0x000000002b75fe60)\n- Handle: Ptr{AMDGPU.rocSPARSE._rocsparse_handle} @0x00000000441200e0\n")
error at ./error.jl:35
macro expansion at /home/fra/.julia/packages/AMDGPU/JCgBw/src/cache.jl:58 [inlined]
macro expansion at ./lock.jl:273 [inlined]
push! at /home/fra/.julia/packages/AMDGPU/JCgBw/src/cache.jl:57
#6 at /home/fra/.julia/packages/AMDGPU/JCgBw/src/cache.jl:118
unknown function (ip: 0x7fb8270c8342)
run_finalizer at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/gc.c:303
jl_gc_run_finalizers_in_list at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/gc.c:393
run_finalizers at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/gc.c:439
ijl_atexit_hook at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/init.c:299
jl_repl_entrypoint at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/jlapi.c:1060
main at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/cli/loader_exe.c:58
unknown function (ip: 0x7fb883a27b0a)
__libc_start_main at /usr/lib/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
ERROR: Package RCCL errored during testing

I shared the code in this repository: GitHub - ffrancesco94/RCCL.jl: Julia wrapper for the Radeon collective communication library (RCCL).. A couple of things that might be worth mentioning:

  1. I tried the tests from the pure-C GitHub - ROCm/rccl-tests: RCCL Performance Benchmark Tests and they seem to work
  2. If I create a small reproducer of e.g. the Broadcast! test and run it with NCCL_DEBUG=INFO, I get the classic NCCL WARN Cuda failure 'invalid argument' error, specifically in the enqueue.ccfile
  3. I find it quite strange that the finalizer fails with something related to rocSPARSE, but it might be that the when it tries to close the context on the GPU it just happens to fail there…

I’d be happy with any feedback if you spot anything strange or if someone that was involved in wrapping NCCL could chime in. It looked like some sort of internal RCCL error but the fact that the pure-C AMD tests worked makes me a bit suspicious. Thank you in advance for any feedback and I hope we can ship this!

More updates: it works!!!
Now it passes the same test suite as NCCL.jl, so I guess it is somewhat usable downstream. The issue was with how I was passing the HIPStream to the C side of things. I’ll write something about this in the repo. Now I will polish it a bit and write some documentation, but I’m quite happy about how it turned out. It was a very good learning experience. The code is available at GitHub - ffrancesco94/RCCL.jl: Julia wrapper for the Radeon collective communication library (RCCL).. Should I announce this somewhere and/or add it to the general package registry (is there some kind of review process?)? Sorry for the somewhat stupid questions but it’s my first time contributing.

1 Like