Hi!
Bit of an update. I managed to automatically generate bindings with Clang.jl and built higher-level abstractions around them. Local discovery works as well. I pretty much copied the tests from NCCL.jl and tried to test it on an HPC cluster with 4 MI 250X per node. Some of the tests fail and I can’t figure out why. The full output is the following:
Info: RCCL info: 2.22.3
sum: Error During Test at /home/fra/.julia/dev/RCCL/test/runtests.jl:31
Got exception outside of a @test
NCCLError(code ncclUnhandledCudaError, a call to a CUDA function failed)
Stacktrace:
[1] check
@ ~/.julia/dev/RCCL/src/librccl.jl:27 [inlined]
[2] ncclGroupEnd
@ ~/.julia/dev/RCCL/src/librccl.jl:42 [inlined]
[3] groupEnd
@ ~/.julia/dev/RCCL/src/group.jl:14 [inlined]
[4] group(f::var"#1#7"{Vector{HIPDevice}, Vector{Communicator}, Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}})
@ RCCL ~/.julia/dev/RCCL/src/group.jl:26
[5] macro expansion
@ ~/.julia/dev/RCCL/test/runtests.jl:40 [inlined]
[6] macro expansion
@ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
[7] macro expansion
@ ~/.julia/dev/RCCL/test/runtests.jl:32 [inlined]
[8] macro expansion
@ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
[9] macro expansion
@ ~/.julia/dev/RCCL/test/runtests.jl:28 [inlined]
[10] macro expansion
@ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
[11] top-level scope
@ ~/.julia/dev/RCCL/test/runtests.jl:13
[12] include(fname::String)
@ Main ./sysimg.jl:38
[13] top-level scope
@ none:6
caused by: NCCLError(code ncclUnhandledCudaError, a call to a CUDA function failed)
Stacktrace:
[1] check
@ ~/.julia/dev/RCCL/src/librccl.jl:27 [inlined]
[2] ncclAllReduce(sendbuff::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, recvbuff::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, count::Int64, datatype::RCCL.LibRCCL.ncclDataType_t, op::RCCL.LibRCCL.ncclRedOp_t, comm::Communicator, stream::Ptr{Nothing})
@ RCCL.LibRCCL ~/.julia/dev/RCCL/src/librccl.jl:42
[3] Allreduce!(sendbuf::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, recvbuf::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, op::Function, comm::Communicator; stream::HIPStream)
@ RCCL ~/.julia/dev/RCCL/src/collective.jl:16
[4] Allreduce!(sendbuf::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, recvbuf::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, op::Function, comm::Communicator)
@ RCCL ~/.julia/dev/RCCL/src/collective.jl:10
[5] (::var"#1#7"{Vector{HIPDevice}, Vector{Communicator}, Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}})()
@ Main ~/.julia/dev/RCCL/test/runtests.jl:42
[6] group(f::var"#1#7"{Vector{HIPDevice}, Vector{Communicator}, Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}})
@ RCCL ~/.julia/dev/RCCL/src/group.jl:24
[7] macro expansion
@ ~/.julia/dev/RCCL/test/runtests.jl:40 [inlined]
[8] macro expansion
@ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
[9] macro expansion
@ ~/.julia/dev/RCCL/test/runtests.jl:32 [inlined]
[10] macro expansion
@ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
[11] macro expansion
@ ~/.julia/dev/RCCL/test/runtests.jl:28 [inlined]
[12] macro expansion
@ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
[13] top-level scope
@ ~/.julia/dev/RCCL/test/runtests.jl:13
[14] include(fname::String)
@ Main ./sysimg.jl:38
[15] top-level scope
@ none:6
RCCL.avg: Error During Test at /home/fra/.julia/dev/RCCL/test/runtests.jl:53
Got exception outside of a @test
NCCLError(code ncclUnhandledCudaError, a call to a CUDA function failed)
Stacktrace:
[1] check
@ ~/.julia/dev/RCCL/src/librccl.jl:27 [inlined]
[2] ncclGroupEnd
@ ~/.julia/dev/RCCL/src/librccl.jl:42 [inlined]
[3] groupEnd
@ ~/.julia/dev/RCCL/src/group.jl:14 [inlined]
[4] group(f::var"#2#8"{Vector{HIPDevice}, Vector{Communicator}, Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}})
@ RCCL ~/.julia/dev/RCCL/src/group.jl:26
[5] macro expansion
@ ~/.julia/dev/RCCL/test/runtests.jl:62 [inlined]
[6] macro expansion
@ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
[7] macro expansion
@ ~/.julia/dev/RCCL/test/runtests.jl:54 [inlined]
[8] macro expansion
@ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
[9] macro expansion
@ ~/.julia/dev/RCCL/test/runtests.jl:28 [inlined]
[10] macro expansion
@ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
[11] top-level scope
@ ~/.julia/dev/RCCL/test/runtests.jl:13
[12] include(fname::String)
@ Main ./sysimg.jl:38
[13] top-level scope
@ none:6
caused by: NCCLError(code ncclUnhandledCudaError, a call to a CUDA function failed)
Stacktrace:
[1] check
@ ~/.julia/dev/RCCL/src/librccl.jl:27 [inlined]
[2] ncclAllReduce
@ ~/.julia/dev/RCCL/src/librccl.jl:42 [inlined]
[3] #Allreduce!#8
@ ~/.julia/dev/RCCL/src/collective.jl:16 [inlined]
[4] Allreduce!(sendbuf::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, recvbuf::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, op::RCCL.LibRCCL.ncclRedOp_t, comm::Communicator)
@ RCCL ~/.julia/dev/RCCL/src/collective.jl:10
[5] (::var"#2#8"{Vector{HIPDevice}, Vector{Communicator}, Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}})()
@ Main ~/.julia/dev/RCCL/test/runtests.jl:64
[6] group(f::var"#2#8"{Vector{HIPDevice}, Vector{Communicator}, Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}})
@ RCCL ~/.julia/dev/RCCL/src/group.jl:24
[7] macro expansion
@ ~/.julia/dev/RCCL/test/runtests.jl:62 [inlined]
[8] macro expansion
@ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
[9] macro expansion
@ ~/.julia/dev/RCCL/test/runtests.jl:54 [inlined]
[10] macro expansion
@ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
[11] macro expansion
@ ~/.julia/dev/RCCL/test/runtests.jl:28 [inlined]
[12] macro expansion
@ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
[13] top-level scope
@ ~/.julia/dev/RCCL/test/runtests.jl:13
[14] include(fname::String)
@ Main ./sysimg.jl:38
[15] top-level scope
@ none:6
Broadcast!: Error During Test at /home/fra/.julia/dev/RCCL/test/runtests.jl:76
Got exception outside of a @test
NCCLError(code ncclUnhandledCudaError, a call to a CUDA function failed)
Stacktrace:
[1] check
@ ~/.julia/dev/RCCL/src/librccl.jl:27 [inlined]
[2] ncclGroupEnd
@ ~/.julia/dev/RCCL/src/librccl.jl:42 [inlined]
[3] groupEnd
@ ~/.julia/dev/RCCL/src/group.jl:14 [inlined]
[4] group(f::var"#3#9"{Int64, Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}, Vector{Communicator}, Vector{HIPDevice}})
@ RCCL ~/.julia/dev/RCCL/src/group.jl:26
[5] macro expansion
@ ~/.julia/dev/RCCL/test/runtests.jl:87 [inlined]
[6] macro expansion
@ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
[7] macro expansion
@ ~/.julia/dev/RCCL/test/runtests.jl:77 [inlined]
[8] macro expansion
@ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
[9] top-level scope
@ ~/.julia/dev/RCCL/test/runtests.jl:13
[10] include(fname::String)
@ Main ./sysimg.jl:38
[11] top-level scope
@ none:6
caused by: NCCLError(code ncclUnhandledCudaError, a call to a CUDA function failed)
Stacktrace:
[1] check
@ ~/.julia/dev/RCCL/src/librccl.jl:27 [inlined]
[2] ncclBroadcast
@ ~/.julia/dev/RCCL/src/librccl.jl:42 [inlined]
[3] #Broadcast!#10
@ ~/.julia/dev/RCCL/src/collective.jl:46 [inlined]
[4] (::var"#3#9"{Int64, Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}, Vector{Communicator}, Vector{HIPDevice}})()
@ Main ~/.julia/dev/RCCL/test/runtests.jl:89
[5] group(f::var"#3#9"{Int64, Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}, Vector{Communicator}, Vector{HIPDevice}})
@ RCCL ~/.julia/dev/RCCL/src/group.jl:24
[6] macro expansion
@ ~/.julia/dev/RCCL/test/runtests.jl:87 [inlined]
[7] macro expansion
@ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
[8] macro expansion
@ ~/.julia/dev/RCCL/test/runtests.jl:77 [inlined]
[9] macro expansion
@ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
[10] top-level scope
@ ~/.julia/dev/RCCL/test/runtests.jl:13
[11] include(fname::String)
@ Main ./sysimg.jl:38
[12] top-level scope
@ none:6
ReduceScatter!: Error During Test at /home/fra/.julia/dev/RCCL/test/runtests.jl:147
Got exception outside of a @test
NCCLError(code ncclUnhandledCudaError, a call to a CUDA function failed)
Stacktrace:
[1] check
@ ~/.julia/dev/RCCL/src/librccl.jl:27 [inlined]
[2] ncclGroupEnd
@ ~/.julia/dev/RCCL/src/librccl.jl:42 [inlined]
[3] groupEnd
@ ~/.julia/dev/RCCL/src/group.jl:14 [inlined]
[4] group(f::var"#6#12"{Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}, Vector{Communicator}, Vector{HIPDevice}})
@ RCCL ~/.julia/dev/RCCL/src/group.jl:26
[5] macro expansion
@ ~/.julia/dev/RCCL/test/runtests.jl:157 [inlined]
[6] macro expansion
@ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
[7] macro expansion
@ ~/.julia/dev/RCCL/test/runtests.jl:148 [inlined]
[8] macro expansion
@ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
[9] top-level scope
@ ~/.julia/dev/RCCL/test/runtests.jl:13
[10] include(fname::String)
@ Main ./sysimg.jl:38
[11] top-level scope
@ none:6
caused by: NCCLError(code ncclUnhandledCudaError, a call to a CUDA function failed)
Stacktrace:
[1] check
@ ~/.julia/dev/RCCL/src/librccl.jl:27 [inlined]
[2] ncclReduceScatter(sendbuff::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, recvbuff::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, recvcount::Int64, datatype::RCCL.LibRCCL.ncclDataType_t, op::RCCL.LibRCCL.ncclRedOp_t, comm::Communicator, stream::Ptr{Nothing})
@ RCCL.LibRCCL ~/.julia/dev/RCCL/src/librccl.jl:42
[3] ReduceScatter!(sendbuf::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, recvbuf::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, op::Function, comm::Communicator; stream::HIPStream)
@ RCCL ~/.julia/dev/RCCL/src/collective.jl:128
[4] ReduceScatter!(sendbuf::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, recvbuf::ROCArray{Float64, 1, AMDGPU.Runtime.Mem.HIPBuffer}, op::Function, comm::Communicator)
@ RCCL ~/.julia/dev/RCCL/src/collective.jl:122
[5] (::var"#6#12"{Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}, Vector{Communicator}, Vector{HIPDevice}})()
@ Main ~/.julia/dev/RCCL/test/runtests.jl:159
[6] group(f::var"#6#12"{Vector{ROCArray{Float64, 1}}, Vector{ROCArray{Float64, 1}}, Vector{Communicator}, Vector{HIPDevice}})
@ RCCL ~/.julia/dev/RCCL/src/group.jl:24
[7] macro expansion
@ ~/.julia/dev/RCCL/test/runtests.jl:157 [inlined]
[8] macro expansion
@ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
[9] macro expansion
@ ~/.julia/dev/RCCL/test/runtests.jl:148 [inlined]
[10] macro expansion
@ ~/.julia/juliaup/julia-1.11.6+0.x64.linux.gnu/share/julia/stdlib/v1.11/Test/src/Test.jl:1709 [inlined]
[11] top-level scope
@ ~/.julia/dev/RCCL/test/runtests.jl:13
[12] include(fname::String)
@ Main ./sysimg.jl:38
[13] top-level scope
@ none:6
Test Summary: | Pass Error Total Time
RCCL.jl | 11 4 15 16.0s
Communicator | 7 7 0.8s
Allreduce! | 2 2 13.7s
sum | 1 1 13.5s
RCCL.avg | 1 1 0.1s
Broadcast! | 1 1 0.2s
Reduce! | 2 2 0.4s
Allgather! | 2 2 0.3s
ReduceScatter! | 1 1 0.4s
ERROR: LoadError: Some tests did not pass: 11 passed, 0 failed, 4 errored, 0 broken.
in expression starting at /home/fra/.julia/dev/RCCL/test/runtests.jl:11
error in running finalizer: ErrorException("Trying to free active handle that is not managed by cache.\n- Key: HIPContext(ptr=0x000000002b75fe60)\n- Handle: Ptr{AMDGPU.rocSPARSE._rocsparse_handle} @0x00000000441200e0\n")
error at ./error.jl:35
macro expansion at /home/fra/.julia/packages/AMDGPU/JCgBw/src/cache.jl:58 [inlined]
macro expansion at ./lock.jl:273 [inlined]
push! at /home/fra/.julia/packages/AMDGPU/JCgBw/src/cache.jl:57
#6 at /home/fra/.julia/packages/AMDGPU/JCgBw/src/cache.jl:118
unknown function (ip: 0x7fb8270c8342)
run_finalizer at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/gc.c:303
jl_gc_run_finalizers_in_list at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/gc.c:393
run_finalizers at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/gc.c:439
ijl_atexit_hook at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/init.c:299
jl_repl_entrypoint at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/src/jlapi.c:1060
main at /cache/build/tester-amdci4-12/julialang/julia-release-1-dot-11/cli/loader_exe.c:58
unknown function (ip: 0x7fb883a27b0a)
__libc_start_main at /usr/lib/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
ERROR: Package RCCL errored during testing
I shared the code in this repository: GitHub - ffrancesco94/RCCL.jl: Julia wrapper for the Radeon collective communication library (RCCL).. A couple of things that might be worth mentioning:
- I tried the tests from the pure-C GitHub - ROCm/rccl-tests: RCCL Performance Benchmark Tests and they seem to work
- If I create a small reproducer of e.g. the
Broadcast!
test and run it with NCCL_DEBUG=INFO
, I get the classic NCCL WARN Cuda failure 'invalid argument'
error, specifically in the enqueue.cc
file
- I find it quite strange that the finalizer fails with something related to
rocSPARSE
, but it might be that the when it tries to close the context on the GPU it just happens to fail there…
I’d be happy with any feedback if you spot anything strange or if someone that was involved in wrapping NCCL could chime in. It looked like some sort of internal RCCL error but the fact that the pure-C AMD tests worked makes me a bit suspicious. Thank you in advance for any feedback and I hope we can ship this!