Allocating different arrays on multiple GPUs

bjarthur · September 27, 2021, 9:06pm

i’ve read the docs on unified memory, but think that’s overkill for my use case. is it really not possible to simply switch devices, define another CuArray, and then launch a kernel on it? i get ERROR_INVALID_VALUE:

julia> using CUDA

julia> ndevices()
4

julia> device()
CuDevice(0): GeForce GTX 1080 Ti

julia> A0=CuArray([1,2,3])
3-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
 1
 2
 3

julia> device!(1)
CuDevice(1): GeForce GTX 1080 Ti

julia> A1=CuArray([1,2,3])
3-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
Error showing value of type CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
ERROR: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)
Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/dev/CUDA/lib/cudadrv/error.jl:105
  [2] macro expansion
    @ ~/.julia/dev/CUDA/lib/cudadrv/error.jl:115 [inlined]
  [3] cuMemcpyDtoHAsync_v2(dstHost::Ptr{Int64}, srcDevice::CuPtr{Int64}, ByteCount::Int64, hStream::CuStream)
    @ CUDA ~/.julia/dev/CUDA/lib/utils/call.jl:26
  [4] #unsafe_copyto!#8
    @ ~/.julia/dev/CUDA/lib/cudadrv/memory.jl:394 [inlined]
  [5] unsafe_copyto!(dest::Vector{Int64}, doffs::Int64, src::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, soffs::Int64, n::Int64)
    @ CUDA ~/.julia/dev/CUDA/src/array.jl:392
  [6] copyto!
    @ ~/.julia/dev/CUDA/src/array.jl:349 [inlined]
  [7] copyto!
    @ ~/.julia/dev/CUDA/src/array.jl:353 [inlined]
  [8] copyto_axcheck!
    @ ./abstractarray.jl:1104 [inlined]
  [9] Vector{Int64}(x::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
    @ Base ./array.jl:563
 [10] Array
    @ ./boot.jl:481 [inlined]
 [11] convert
    @ ./array.jl:554 [inlined]
 [12] adapt_storage
    @ ~/.julia/dev/GPUArrays/src/host/abstractarray.jl:45 [inlined]
 [13] adapt_structure
    @ ~/.julia/packages/Adapt/RGNRk/src/Adapt.jl:42 [inlined]
 [14] adapt
    @ ~/.julia/packages/Adapt/RGNRk/src/Adapt.jl:40 [inlined]
 [15] print_array
    @ ~/.julia/dev/GPUArrays/src/host/abstractarray.jl:48 [inlined]
 [16] show(io::IOContext{Base.TTY}, #unused#::MIME{Symbol("text/plain")}, X::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer})
    @ Base ./arrayshow.jl:396
 [17] (::REPL.var"#43#44"{REPL.REPLDisplay{REPL.LineEditREPL}, MIME{Symbol("text/plain")}, Base.RefValue{Any}})(io::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:264
 [18] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:508
 [19] display(d::REPL.REPLDisplay, mime::MIME{Symbol("text/plain")}, x::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:257
 [20] display(d::REPL.REPLDisplay, x::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:269
 [21] display(x::Any)
    @ Base.Multimedia ./multimedia.jl:328
 [22] #invokelatest#2
    @ ./essentials.jl:716 [inlined]
 [23] invokelatest
    @ ./essentials.jl:714 [inlined]
 [24] print_response(errio::IO, response::Any, show_value::Bool, have_color::Bool, specialdisplay::Union{Nothing, AbstractDisplay})
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:291
 [25] (::REPL.var"#45#46"{REPL.LineEditREPL, Pair{Any, Bool}, Bool, Bool})(io::Any)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:275
 [26] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:508
 [27] print_response(repl::REPL.AbstractREPL, response::Any, show_value::Bool, have_color::Bool)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:273
 [28] (::REPL.var"#do_respond#66"{Bool, Bool, REPL.var"#77#87"{REPL.LineEditREPL, REPL.REPLHistoryProvider}, REPL.LineEditREPL, REPL.LineEdit.Prompt})(s::REPL.LineEdit.MIState, buf::Any, ok::Bool)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:844
 [29] #invokelatest#2
    @ ./essentials.jl:716 [inlined]
 [30] invokelatest
    @ ./essentials.jl:714 [inlined]
 [31] run_interface(terminal::REPL.Terminals.TextTerminal, m::REPL.LineEdit.ModalInterface, s::REPL.LineEdit.MIState)
    @ REPL.LineEdit /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/LineEdit.jl:2493
 [32] run_frontend(repl::REPL.LineEditREPL, backend::REPL.REPLBackendRef)
    @ REPL /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/REPL/src/REPL.jl:1230
 [33] (::REPL.var"#49#54"{REPL.LineEditREPL, REPL.REPLBackendRef})()
    @ REPL ./task.jl:411

marius311 · September 27, 2021, 9:26pm

I think this is just a printing bug, i.e. if you suppress the output it works A1=CuArray([1,2,3]); I just checked and its present on master, so you could consider filling an Issue if there’s not one already.

Note that the unified memory is actually pretty easy, e.g.:

julia> using CUDA

julia> device!(0)
CuDevice(0): A100-SXM4-40GB

julia> A = cu([1,2,3], unified=true)
3-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:
 1
 2
 3

julia> device!(1)
CuDevice(1): A100-SXM4-40GB

julia> sin.(A); # can broadcast on GPU1 even though array created on GPU0

(although the same printing error seems to be there)

bjarthur · September 28, 2021, 12:53am

wow! where’d you find cu? is that a stable API? much more succinct than the example of Mem.alloc(Mem.Unified...); unsafe_wrap(); finalizer() do Mem.free(); end; copyto!()

maleadt · September 28, 2021, 5:13am

It’s not just a printing bug, CuArrays allocated in device memory are implicitly tied to the current device, see Multiple GPUs · CUDA.jl. I plan to make that explicit in the future so that you get a nice error, but for now you have to manage that manually.

marius311 · September 28, 2021, 5:49am

Are you sure you read the MWE above right? It switches devices then allocates a new array, but printing it errors, before switching away to any other device. Perhaps I’m misunderstanding something, but that seems like it should work (and I thought it in fact did in previous versions).

maleadt · September 28, 2021, 7:35am

Oops, yeah I didn’t read that carefully. That error is totally unexpected, but maybe it has to do because we used to synchronize the task-local stream instead of the entire device. @bjarthur could you try the master branch, including https://github.com/JuliaGPU/CUDA.jl/pull/1147?

bjarthur · September 28, 2021, 11:32am

sorry, i should’ve mentioned that i already did try it out on master, yesterday. and with julia 1.7-rc1. same error.

what about cu? is that a stable API for unified memory? the docs give a much more laborious example.

[EDIT: same error with julia 1.6 and CUDA 3.4.2. so i don’t think it’s due to any recent changes]

bjarthur · September 29, 2021, 9:31pm

@maleadt should i file an issue on CUDA.jl about switching devices and not being able to allocate a new CuArray as shown above?

maleadt · September 30, 2021, 6:02am

Yes, but please add as much relevant information as possible, because I can’t reproduce this issue.

The cu unified memory interface is experimental, I’m not sure about the API yet, is why it isn’t documented.

Topic		Replies	Views
multiple-GPUs per process GPU	3	343	April 27, 2023
Faster small CUDA memory transfers (UnifiedMem?) GPU cuda	2	898	August 31, 2020
CuArrays: error calling CuArray() (ERROR_INVALID_DEVICE) GPU question	25	3733	February 16, 2020
Julia with CuArray issue New to Julia cuda , cudajl	2	176	July 1, 2024
ANN: CUDAnative 3.0 and CuArrays 2.0 Package Announcements	3	852	March 29, 2020

Allocating different arrays on multiple GPUs

Related topics