Allocating different arrays on multiple GPUs

i’ve read the docs on unified memory, but think that’s overkill for my use case. is it really not possible to simply switch devices, define another CuArray, and then launch a kernel on it? i get ERROR_INVALID_VALUE:

julia> using CUDA

julia> ndevices()
4

julia> device()
CuDevice(0): GeForce GTX 1080 Ti

julia> A0=CuArray([1,2,3])
3-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
 1
 2
 3

julia> device!(1)
CuDevice(1): GeForce GTX 1080 Ti

julia> A1=CuArray([1,2,3])
3-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
Error showing value of type CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
ERROR: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)
Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/dev/CUDA/lib/cudadrv/error.jl:105
  [2] macro expansion
    @ ~/.julia/dev/CUDA/lib/cudadrv/error.jl:115 [inlined]
  [3] cuMemcpyDtoHAsync_v2(dstHost::Ptr{Int64}, srcDevice::CuPtr{Int64}, ByteCount::Int64, hStream::CuStream)
    @ CUDA ~/.julia/dev/CUDA/lib/utils/call.jl:26
  [4] #unsafe_copyto!#8
    @ ~/.julia/dev/CUDA/lib/cudadrv/memory.jl:394 [inlined]
  [5] unsafe_copyto!(dest::Vector{Int64}, doffs::Int64, src::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, soffs::Int64, n::Int64)
    @ CUDA ~/.julia/dev/CUDA/src/array.jl:392
  [6] copyto!
    @ ~/.julia/dev/CUDA/src/array.jl:349 [inlined]
  [7] copyto!
    @ ~/.julia/dev/CUDA/src/array.jl:353 [inlined]
  [8] copyto_axcheck!
    @ ./abstractarray.jl:1104 [inlined]
  [9] Vector{Int64}(x::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
    @ Base ./array.jl:563
 [10] Array
    @ ./boot.jl:481 [inlined]
 [11] convert
    @ ./array.jl:554 [inlined]
 [12] adapt_storage
    @ ~/.julia/dev/GPUArrays/src/host/abstractarray.jl:45 [inlined]
 [13] adapt_structure
    @ ~/.julia/packages/Adapt/RGNRk/src/Adapt.jl:42 [inlined]
 [14] adapt
    @ ~/.julia/packages/Adapt/RGNRk/src/Adapt.jl:40 [inlined]
 [15] print_array
    @ ~/.julia/dev/GPUArrays/src/host/abstractarray.jl:48 [inlined]
 [16] show(io::IOContext{Base.TTY}, #unused#::MIME{Symbol("text/plain")}, X::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
    @ Base ./arrayshow.jl:396
 [17] (::REPL.var"#43#44"{REPL.REPLDisplay{REPL.LineEditREPL}, MIME{Symbol("text/plain")}, Base.RefValue{Any}})(io::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:264
 [18] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:508
 [19] display(d::REPL.REPLDisplay, mime::MIME{Symbol("text/plain")}, x::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:257
 [20] display(d::REPL.REPLDisplay, x::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:269
 [21] display(x::Any)
    @ Base.Multimedia ./multimedia.jl:328
 [22] #invokelatest#2
    @ ./essentials.jl:716 [inlined]
 [23] invokelatest
    @ ./essentials.jl:714 [inlined]
 [24] print_response(errio::IO, response::Any, show_value::Bool, have_color::Bool, specialdisplay::Union{Nothing, AbstractDisplay})
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:291
 [25] (::REPL.var"#45#46"{REPL.LineEditREPL, Pair{Any, Bool}, Bool, Bool})(io::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:275
 [26] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:508
 [27] print_response(repl::REPL.AbstractREPL, response::Any, show_value::Bool, have_color::Bool)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:273
 [28] (::REPL.var"#do_respond#66"{Bool, Bool, REPL.var"#77#87"{REPL.LineEditREPL, REPL.REPLHistoryProvider}, REPL.LineEditREPL, REPL.LineEdit.Prompt})(s::REPL.LineEdit.MIState, buf::Any, ok::Bool)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:844
 [29] #invokelatest#2
    @ ./essentials.jl:716 [inlined]
 [30] invokelatest
    @ ./essentials.jl:714 [inlined]
 [31] run_interface(terminal::REPL.Terminals.TextTerminal, m::REPL.LineEdit.ModalInterface, s::REPL.LineEdit.MIState)
    @ REPL.LineEdit /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/LineEdit.jl:2493
 [32] run_frontend(repl::REPL.LineEditREPL, backend::REPL.REPLBackendRef)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:1230
 [33] (::REPL.var"#49#54"{REPL.LineEditREPL, REPL.REPLBackendRef})()
    @ REPL ./task.jl:411

I think this is just a printing bug, i.e. if you suppress the output it works A1=CuArray([1,2,3]); I just checked and its present on master, so you could consider filling an Issue if there’s not one already.

Note that the unified memory is actually pretty easy, e.g.:

julia> using CUDA

julia> device!(0)
CuDevice(0): A100-SXM4-40GB

julia> A = cu([1,2,3], unified=true)
3-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:
 1
 2
 3

julia> device!(1)
CuDevice(1): A100-SXM4-40GB

julia> sin.(A); # can broadcast on GPU1 even though array created on GPU0

(although the same printing error seems to be there)

wow! where’d you find cu? is that a stable API? much more succinct than the example of Mem.alloc(Mem.Unified...); unsafe_wrap(); finalizer() do Mem.free(); end; copyto!()

It’s not just a printing bug, CuArrays allocated in device memory are implicitly tied to the current device, see Multiple GPUs · CUDA.jl. I plan to make that explicit in the future so that you get a nice error, but for now you have to manage that manually.

Are you sure you read the MWE above right? It switches devices then allocates a new array, but printing it errors, before switching away to any other device. Perhaps I’m misunderstanding something, but that seems like it should work (and I thought it in fact did in previous versions).

1 Like

Oops, yeah I didn’t read that carefully. That error is totally unexpected, but maybe it has to do because we used to synchronize the task-local stream instead of the entire device. @bjarthur could you try the master branch, including Native device synchronization by maleadt · Pull Request #1147 · JuliaGPU/CUDA.jl · GitHub?

sorry, i should’ve mentioned that i already did try it out on master, yesterday. and with julia 1.7-rc1. same error.

what about cu? is that a stable API for unified memory? the docs give a much more laborious example.

[EDIT: same error with julia 1.6 and CUDA 3.4.2. so i don’t think it’s due to any recent changes]

@maleadt should i file an issue on CUDA.jl about switching devices and not being able to allocate a new CuArray as shown above?

Yes, but please add as much relevant information as possible, because I can’t reproduce this issue.

The cu unified memory interface is experimental, I’m not sure about the API yet, is why it isn’t documented.