Distributing LLM over multiple GPUs

Hi all,

I am experimenting with gradient optimization of prompts for large language models using an excellent Transformers.jl library. I know that most of the people would say that I should switch to PyTorch (or Jax) and they might be right, but I like to use these things to investigate boundaries of Julia. So far, I have been playing with “tiny” GPT2 model, which easily fits into a single GPU (I have big GPUs). Now, I would like to use a bigger model GPTJ with 6b parameters and when I ask for gradient, I run out of memory. So I want to try to dsitrbute the model over multiple GPUs (for fun).

My original idea was to use Dagger.jl, but I kind of do not know, how to do proper scopes around parts of the model. So I decided to try things manually as follows

using Transformers
using Transformers.HuggingFace
using Flux
using CUDA


model = cpu(hgf"gpt2:lmheadmodel")
decoder = model.model.decoder

device!(0) 
m₁ = gpu(decoder.layers[1][1:6])
device!(1) 
m₂ = gpu(decoder.layers[1][7:12])

which seems reasonable to me. I will have first six layers on first GPU and second six layers on second GPU (for GPT2, this is really not needed but good for experiments). The issue is that the second call always errors with

CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)

and I am completely clueless why this happens. Therefore I am begging for help @maleadt or @jpsamaroo ?

Just for the sake of diskussion, I wanted then in add a layers that would forward hidden representations between layers as

function move_between_gpu(x::AbstractArray, from, to)
	cx = cpu(x)
	device!(to)
	gpu(cx)
end

function ChainRulesCore.rrule(::typeof(move_between_gpu), x::AbstractArray, from, to)
    y = move_between_gpu(x, from, to)
    function move_between_gpu_pullback(ȳ)
        return(NoTangent(), move_between_gpu(ȳ, to, from), NoTangent(), NoTangent())
    end
    return(y, move_between_gpu_pullback)
end

(p::MoveGPU)(x) = move_between_gpu(x, from, to)

struct MoveGPU 
	from::Int
	to::Int
end

Which is Naive, as I should do GPU2GPU communcation but I consider this to be good first try.

All comments are very much appreciated.

I have forget to put my environgment.

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, icelake-server)
  Threads: 1 on 64 virtual cores


Status `/var/tmp/Project.toml`
  [052768ef] CUDA v4.4.0
⌅ [587475ba] Flux v0.13.17
  [21ca0261] Transformers v0.2.6
Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use `status --outdated`

and I use

I’m not really familiar with Flux or Transformers, but this would likely be caused by accessing data that’s located on device 0 after having switched to device 1. You could work around this by using unified memory, which can be accessed from another device, but that’s only a workaround.

Hi Tim,

thanks for the answer. This is what I was suspecting, but I cannot figure out, how this can happen in the above code, as it should “divide” model in two independent parts. I will continue searching.

Thanks
Tomas

Hey Tomas, do you have a stack trace for that illegal memory access error? I agree it seems like it should just work.

While generating the fresh stack-trace, I realized that the problem is in printing (show). When I add semicolons as

using Transformers
using Transformers.HuggingFace
using Flux
using CUDA


model = cpu(hgf"gpt2:lmheadmodel")
decoder = model.model.decoder

device!(0) 
m₁ = gpu(decoder.layers[1][1:6]);
device!(1) 
m₂ = gpu(decoder.layers[1][7:12]);

It runs without issues. Nice. I do not know, if I should mark it as solved, because the bug is there, but somewhere else.

The stacktrace:

Summary
    julia> m₂ = gpu(decoder.layers[1][7:12])
Transformer<6>(                                                                                                     PreNormTransformerBlock(
    DropoutLayer<nothing>(
      SelfAttention(
        CausalMultiheadQKVAttenOp(head = 12, p = nothing),
        NSplit<3>(Dense(W = (768, 2304), b = true)),  # 1_771_776 parametersError showing value of type Transforme
r{NTuple{6, Transformers.Layers.PreNormTransformerBlock{Transformers.Layers.DropoutLayer{Transformers.Layers.SelfA
ttention{NeuralAttentionlib.CausalMultiheadQKVAttenOp{Nothing}, Transformers.Layers.NSplit{Static.StaticInt{3}, Tr
ansformers.Layers.Dense{Nothing, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBu
ffer}}}, Transformers.Layers.Dense{Nothing, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, Nothing}, Transformers.Layers.LayerNorm{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Fl
oat32, 1, CUDA.Mem.DeviceBuffer}, Float32}, Transformers.Layers.DropoutLayer{Transformers.Layers.Chain{Tuple{Trans
formers.Layers.Dense{typeof(gelu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Transformers.Layers.Dense{Nothing, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.
Mem.DeviceBuffer}}}}, Nothing}, Transformers.Layers.LayerNorm{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{
Float32, 1, CUDA.Mem.DeviceBuffer}, Float32}}}, Nothing}:
ERROR: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/tVtYo/lib/cudadrv/libcuda.jl:27                                                   [2] check
    @ ~/.julia/packages/CUDA/tVtYo/lib/cudadrv/libcuda.jl:34 [inlined]
  [3] cuMemcpyDtoHAsync_v2
    @ ~/.julia/packages/CUDA/tVtYo/lib/utils/call.jl:26 [inlined]
  [4] #unsafe_copyto!#8                                                                                               @ ~/.julia/packages/CUDA/tVtYo/lib/cudadrv/memory.jl:397 [inlined]
  [5] (::CUDA.var"#1014#1015"{Bool, Vector{Bool}, Int64, CuArray{Bool, 2, CUDA.Mem.DeviceBuffer}, Int64, Int64})()
    @ CUDA ~/.julia/packages/CUDA/tVtYo/src/array.jl:482
  [6] #context!#887
    @ ~/.julia/packages/CUDA/tVtYo/lib/cudadrv/state.jl:170 [inlined]
  [7] context!
    @ ~/.julia/packages/CUDA/tVtYo/lib/cudadrv/state.jl:170 [inlined]                                    [62/1869]  [7] context!                                                                                                        @ ~/.julia/packages/CUDA/tVtYo/lib/cudadrv/state.jl:165 [inlined]
  [8] unsafe_copyto!(dest::Vector{Bool}, doffs::Int64, src::CuArray{Bool, 2, CUDA.Mem.DeviceBuffer}, soffs::Int64,
 n::Int64)
    @ CUDA ~/.julia/packages/CUDA/tVtYo/src/array.jl:475
  [9] copyto!                                                                                                         @ ~/.julia/packages/CUDA/tVtYo/src/array.jl:429 [inlined]                                                      [10] getindex                                                                                                        @ ~/.julia/packages/GPUArrays/5XhED/src/host/indexing.jl:12 [inlined]                                          [11] macro expansion                                                                                                 @ ~/.julia/packages/GPUArraysCore/uOYfN/src/GPUArraysCore.jl:136 [inlined]                                     [12] _mapreduce(f::ComposedFunction{typeof(!), typeof(iszero)}, op::typeof(|), As::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}; dims::Colon, init::Nothing)                                                                             @ GPUArrays ~/.julia/packages/GPUArrays/5XhED/src/host/mapreduce.jl:73                                         [13] _mapreduce                                                                                                      @ ~/.julia/packages/GPUArrays/5XhED/src/host/mapreduce.jl:35 [inlined]
 [14] #mapreduce#29
    @ ~/.julia/packages/GPUArrays/5XhED/src/host/mapreduce.jl:31 [inlined]                                         [15] mapreduce
    @ ~/.julia/packages/GPUArrays/5XhED/src/host/mapreduce.jl:31 [inlined]
 [16] any
    @ ~/.julia/packages/GPUArrays/5XhED/src/host/mapreduce.jl:82 [inlined]
 [17] _any
    @ ~/.julia/packages/Flux/n3cOc/src/layers/show.jl:129 [inlined]
 [18] (::Flux.var"#337#338"{ComposedFunction{typeof(!), typeof(iszero)}})(x::CuArray{Float32, 2, CUDA.Mem.DeviceBu
ffer})
    @ Flux ~/.julia/packages/Flux/n3cOc/src/layers/show.jl:131
 [19] _any(f::Flux.var"#337#338"{ComposedFunction{typeof(!), typeof(iszero)}}, itr::Zygote.Params{Zygote.Buffer{An
y, Vector{Any}}}, #unused#::Colon)
    @ Base ./reduce.jl:1215
 [20] any
    @ ./reduce.jl:1210 [inlined]
 [21] _any                                                                                                            @ ~/.julia/packages/Flux/n3cOc/src/layers/show.jl:131 [inlined]
 [22] _all
    @ ~/.julia/packages/Flux/n3cOc/src/layers/show.jl:135 [inlined]
 [23] _nan_show(io::IOContext{Base.TTY}, x::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}})
    @ Flux ~/.julia/packages/Flux/n3cOc/src/layers/show.jl:120
 [24] _layer_show(io::IOContext{Base.TTY}, layer::Transformers.Layers.NSplit{Static.StaticInt{3}, Transformers.Lay
ers.Dense{Nothing, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, inden
t::Int64, name::Nothing)
    @ Flux ~/.julia/packages/Flux/n3cOc/src/layers/show.jl:86                                                      [25] _big_show (repeats 2 times)
    @ ~/.julia/packages/Transformers/694He/src/layers/utils.jl:117 [inlined]
 [26] _big_show(io::IOContext{Base.TTY}, layer::Transformers.Layers.SelfAttention{NeuralAttentionlib.CausalMultiheadQKVAttenOp{Nothing}, Transformers.Layers.NSplit{Static.StaticInt{3}, Transformers.Layers.Dense{Nothing, CuArray{
Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, Transformers.Layers.Dense{Nothin
g, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, indent::Int64, name::
Nothing)
    @ Transformers.Layers ~/.julia/packages/Transformers/694He/src/layers/architecture.jl:363
 [27] _big_show
    @ ~/.julia/packages/Transformers/694He/src/layers/architecture.jl:361 [inlined]                               --- the last 2 lines are repeated 2 more times ---
 [32] _big_show(io::IOContext{Base.TTY}, t::Transformer{NTuple{6, Transformers.Layers.PreNormTransformerBlock{Tran
sformers.Layers.DropoutLayer{Transformers.Layers.SelfAttention{NeuralAttentionlib.CausalMultiheadQKVAttenOp{Nothin
g}, Transformers.Layers.NSplit{Static.StaticInt{3}, Transformers.Layers.Dense{Nothing, CuArray{Float32, 2, CUDA.Me
m.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, Transformers.Layers.Dense{Nothing, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, Nothing}, Transformers.Layers.LayerNorm{
CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32}, Transformers.Lay
ers.DropoutLayer{Transformers.Layers.Chain{Tuple{Transformers.Layers.Dense{typeof(gelu), CuArray{Float32, 2, CUDA.
Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Transformers.Layers.Dense{Nothing, CuArray{Float32
, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, Nothing}, Transformers.Layers.LayerNor
m{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32}}}, Nothing}, i, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, Nothing}, Transformers.Layers.LayerNorm{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32}}}, Nothing}, indent::Int64, name::Nothing)
    @ Transformers.Layers ~/.julia/packages/Transformers/694He/src/layers/layer.jl:254
 [33] _big_show
    @ ~/.julia/packages/Transformers/694He/src/layers/layer.jl:252 [inlined]
 [34] show(io::IOContext{Base.TTY}, m::MIME{Symbol("text/plain")}, x::Transformer{NTuple{6, Transformers.Layers.PreNormTransformerBlock{Transformers.Layers.DropoutLayer{Transformers.Layers.SelfAttention{NeuralAttentionlib.CausalMultiheadQKVAttenOp{Nothing}, Transformers.Layers.NSplit{Static.StaticInt{3}, Transformers.Layers.Dense{Nothing, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, Transformers.Layers.Dense{Nothing, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, Nothing}, Transformers.Layers.LayerNorm{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32}, Transformers.Layers.DropoutLayer{Transformers.Layers.Chain{Tuple{Transformers.Layers.Dense{typeof(gelu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Transformers.Layers.Dense{Nothing, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, Nothing}, Transformers.Layers.LayerNorm{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32}}}, Nothing})
    @ Transformers.Layers ~/.julia/packages/Transformers/694He/src/layers/utils.jl:97
 [35] (::REPL.var"#55#56"{REPL.REPLDisplay{REPL.LineEditREPL}, MIME{Symbol("text/plain")}, Base.RefValue{Any}})(io::Any)
    @ REPL /opt/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:276
 [36] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL /opt/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:557
 [37] display(d::REPL.REPLDisplay, mime::MIME{Symbol("text/plain")}, x::Any)
    @ REPL /opt/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:262
 [38] display
    @ /opt/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:281 [inlined]
 [39] display(x::Any)
    @ Base.Multimedia ./multimedia.jl:340
 [40] #invokelatest#2
    @ ./essentials.jl:816 [inlined]
 [41] invokelatest
    @ ./essentials.jl:813 [inlined]
 [42] print_response(errio::IO, response::Any, show_value::Bool, have_color::Bool, specialdisplay::Union{Nothing, AbstractDisplay})
    @ REPL /opt/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:305
 [43] (::REPL.var"#57#58"{REPL.LineEditREPL, Pair{Any, Bool}, Bool, Bool})(io::Any)
    @ REPL /opt/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:287
 [44] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL /opt/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:557
 [45] print_response(repl::REPL.AbstractREPL, response::Any, show_value::Bool, have_color::Bool)
    @ REPL /opt/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:285
 [46] (::REPL.var"#do_respond#80"{Bool, Bool, REPL.var"#93#103"{REPL.LineEditREPL, REPL.REPLHistoryProvider}, REPL.LineEditREPL, REPL.LineEdit.Prompt})(s::REPL.LineEdit.MIState, buf::Any, ok::Bool)
    @ REPL /opt/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:899
 [47] (::REPL.var"#98#108"{Regex, Regex, Int64, Int64, REPL.LineEdit.Prompt, REPL.LineEdit.Prompt, REPL.LineEdit.Prompt})(::REPL.LineEdit.MIState, ::Any, ::Vararg{Any})
    @ REPL /opt/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:1236
 [48] #invokelatest#2
    @ ./essentials.jl:816 [inlined]
 [49] invokelatest
    @ ./essentials.jl:813 [inlined]
 [50] (::REPL.LineEdit.var"#27#28"{REPL.var"#98#108"{Regex, Regex, Int64, Int64, REPL.LineEdit.Prompt, REPL.LineEdit.Prompt, REPL.LineEdit.Prompt}, String})(s::Any, p::Any)
    @ REPL.LineEdit /opt/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/LineEdit.jl:1603
 [51] prompt!(term::REPL.Terminals.TextTerminal, prompt::REPL.LineEdit.ModalInterface, s::REPL.LineEdit.MIState)
    @ REPL.LineEdit /opt/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/LineEdit.jl:2740
 [52] run_interface(terminal::REPL.Terminals.TextTerminal, m::REPL.LineEdit.ModalInterface, s::REPL.LineEdit.MIState)
    @ REPL.LineEdit /opt/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/LineEdit.jl:2642
 [53] run_frontend(repl::REPL.LineEditREPL, backend::REPL.REPLBackendRef)
    @ REPL /opt/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:1300

Interesting, that’s not actually the show code for CuArrays. Does trying to show any parameter array in m2 raise an error? If not, how about running a host-side operation like adding 1 or sum? Also I presume this is the case, but are you running things from the default REPL and not say the VS Code one?

I run the stuff in repl, not in VSCode.
Which commands would you like to run exactly?

Too difficult to type out whole code blocks on a phone, but I was thinking just to pull out random parameter arrays from the model and try running arbitrary GPU operations on them to see if the same error is thrown. If so, the problem is not in the Flux printing code (and thus I’d then recommend testing whether transferring arryas to multiple devices without Flux has the same issue)

OK, make sense. I had to work now, but will try it.

1 Like