Data Science lessons: Making "10 - Neural Networks" run on GPU?

drscotthawley · January 14, 2022, 4:24am

Hi, I’ve been doing Dr. Nassar’s @nassarhuda’s excellent Data Science tutorial series on Julia Academy and just finished lesson 10 on Neural Nets (Neural Nets | JuliaAcademy).

I’m now trying to edit the notebook from the course so it will run on GPU (I have CUDA installed and an NVIDIA card), but so far no success. Can anyone advise?

After reading the GPU Support documentation, here are the key changes I made to the lesson notebook:

using CUDA
...
m = Chain(
  Dense(28^2, 32, relu),
  Dense(32, 10),
  softmax) |> gpu
...
datasetx = repeated((X, Y), 200) |> gpu

…But then when I run the Flux.train! line, I get big long error trace:

opt = ADAM()
Flux.train!(loss, ps, datasetx, opt, cb = throttle(evalcb, 10))

ArgumentError: cannot take the CPU address of a CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}

“CPU address”? Huh? I thought I moved everything – the model & the data – over to the GPU. (With PyTorch, that would be sufficient. But I’m “New to Julia”. ) .

Stacktrace:
  [1] unsafe_convert(#unused#::Type{Ptr{Float32}}, x::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
    @ CUDA ~/.julia/packages/CUDA/sCev8/src/array.jl:315
  [2] gemm!(transA::Char, transB::Char, alpha::Float32, A::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, B::Matrix{Float32}, beta::Float32, C::Matrix{Float32})
    @ LinearAlgebra.BLAS /opt/julia-1.7.1/share/julia/stdlib/v1.7/LinearAlgebra/src/blas.jl:1421
  [3] gemm_wrapper!(C::Matrix{Float32}, tA::Char, tB::Char, A::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, B::Matrix{Float32}, _add::LinearAlgebra.MulAddMul{true, true, Bool, Bool})
    @ LinearAlgebra /opt/julia-1.7.1/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:671
  [4] mul!
    @ /opt/julia-1.7.1/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:169 [inlined]
  [5] mul!
    @ /opt/julia-1.7.1/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:275 [inlined]
  [6] *
    @ /opt/julia-1.7.1/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:160 [inlined]
  [7] rrule
    @ ~/.julia/packages/ChainRules/2moSB/src/rulesets/Base/arraymath.jl:60 [inlined]
  [8] rrule
    @ ~/.julia/packages/ChainRulesCore/IFusD/src/rules.jl:134 [inlined]
  [9] chain_rrule
    @ ~/.julia/packages/Zygote/umM0L/src/compiler/chainrules.jl:216 [inlined]
 [10] macro expansion
    @ ~/.julia/packages/Zygote/umM0L/src/compiler/interface2.jl:0 [inlined]
 [11] _pullback
    @ ~/.julia/packages/Zygote/umM0L/src/compiler/interface2.jl:9 [inlined]
 [12] _pullback
    @ ~/.julia/packages/Flux/0c9kI/src/layers/basic.jl:147 [inlined]
 [13] _pullback(ctx::Zygote.Context, f::Dense{typeof(relu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, args::Matrix{Float32})
    @ Zygote ~/.julia/packages/Zygote/umM0L/src/compiler/interface2.jl:0
 [14] _pullback
    @ ~/.julia/packages/Flux/0c9kI/src/layers/basic.jl:36 [inlined]
 [15] _pullback(::Zygote.Context, ::typeof(Flux.applychain), ::Tuple{Dense{typeof(relu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, typeof(softmax)}, ::Matrix{Float32})
    @ Zygote ~/.julia/packages/Zygote/umM0L/src/compiler/interface2.jl:0
 [16] _pullback
    @ ~/.julia/packages/Flux/0c9kI/src/layers/basic.jl:38 [inlined]
 [17] _pullback(ctx::Zygote.Context, f::Chain{Tuple{Dense{typeof(relu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, typeof(softmax)}}, args::Matrix{Float32})
    @ Zygote ~/.julia/packages/Zygote/umM0L/src/compiler/interface2.jl:0
 [18] _pullback
    @ ./In[21]:1 [inlined]
 [19] _pullback(::Zygote.Context, ::typeof(loss), ::Matrix{Float32}, ::Flux.OneHotArray{UInt32, 10, 1, 2, Vector{UInt32}})
    @ Zygote ~/.julia/packages/Zygote/umM0L/src/compiler/interface2.jl:0
 [20] _apply
    @ ./boot.jl:814 [inlined]
 [21] adjoint
    @ ~/.julia/packages/Zygote/umM0L/src/lib/lib.jl:200 [inlined]
 [22] _pullback
    @ ~/.julia/packages/ZygoteRules/AIbCs/src/adjoint.jl:65 [inlined]
 [23] _pullback
    @ ~/.julia/packages/Flux/0c9kI/src/optimise/train.jl:102 [inlined]
 [24] _pullback(::Zygote.Context, ::Flux.Optimise.var"#39#45"{typeof(loss), Tuple{Matrix{Float32}, Flux.OneHotArray{UInt32, 10, 1, 2, Vector{UInt32}}}})
    @ Zygote ~/.julia/packages/Zygote/umM0L/src/compiler/interface2.jl:0
 [25] pullback(f::Function, ps::Zygote.Params)
    @ Zygote ~/.julia/packages/Zygote/umM0L/src/compiler/interface.jl:352
 [26] gradient(f::Function, args::Zygote.Params)
    @ Zygote ~/.julia/packages/Zygote/umM0L/src/compiler/interface.jl:75
 [27] macro expansion
    @ ~/.julia/packages/Flux/0c9kI/src/optimise/train.jl:101 [inlined]
 [28] macro expansion
    @ ~/.julia/packages/Juno/n6wyj/src/progress.jl:134 [inlined]
 [29] train!(loss::Function, ps::Zygote.Params, data::Base.Iterators.Take{Base.Iterators.Repeated{Tuple{Matrix{Float32}, Flux.OneHotArray{UInt32, 10, 1, 2, Vector{UInt32}}}}}, opt::ADAM; cb::Flux.var"#throttled#73"{Flux.var"#throttled#69#74"{Bool, Bool, var"#1#2", Int64}})
    @ Flux.Optimise ~/.julia/packages/Flux/0c9kI/src/optimise/train.jl:99
 [30] top-level scope
    @ In[26]:2
 [31] eval
    @ ./boot.jl:373 [inlined]
 [32] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1196

I find it odd that, even after converting the model to GPU, the datatype of the model parameters is still just Float32 instead of CuArray… i.e. I was expecting ps = Flux.params(m) to yield not Params([Float32... but rather say something about model parameters as CUDA variables. (And running m = fmap(cu, m) made no changes to that, BTW.)

Also, after moving datasetx to |> gpu (or alternately, wrapping in it cu()), printing it out only shows regular Float32-style variable types, nothing about CUDA or CuArray, etc.

…?

UPDATES:

fmap applying cu on datasetx also has no effect
regular map of cu on datasetx causes a CUDA Out Of Memory error.
the following two lines seem to be sufficient for success:
datasetx = repeated((X|>gpu, Y|>gpu), 200) (not what I had before) and evalcb = () -> @show(loss(X|>gpu, Y|>gpu))
Alternatively, moving X and Y individually to GPU first, as in X = X |> gpu; Y = Y |> gpu, before the definitations of datasetx and evalcb, also seems sufficient.

Tomas_Pevny · January 14, 2022, 6:17am

HI,

I do not see the full code, but I think you have incorrect assumptions on how things works.

If you move the parameters of the model to GPU, you will get a new model (like a newly allocated structs) and if you use params ps collected on a model of CPU, you need to recollect them from the model on GPU, using the same Flux.params(m)
When you run repeated((X, Y), 200), you are creating an iterator which will spits (x,y) 200x. When you pass that iterator through gpu function, I do not this the gpu dive into the iterator. This is why you need to do repeated(gpu((x,y)), 200)

This is sort of my guesses, what might be wrong. Without MWE, it is difficult to find all errors.

mcabbott · January 14, 2022, 7:29am

You can try teaching Functors that it ought to look inside these iterators. Maybe nobody thought to try this before, but it does seem natural that it ought to work:

julia> using Flux, Functors

julia> Iterators.repeated([1,2,3],4) |> f64 |> collect
4-element Vector{Vector{Int64}}:
 [1, 2, 3]
 [1, 2, 3]
 [1, 2, 3]
 [1, 2, 3]

julia> Functors.@functor Iterators.Take

julia> Functors.@functor Iterators.Repeated

julia> Iterators.repeated([1,2,3],4) |> f64 |> collect
4-element Vector{Vector{Float64}}:
 [1.0, 2.0, 3.0]
 [1.0, 2.0, 3.0]
 [1.0, 2.0, 3.0]
 [1.0, 2.0, 3.0]

drscotthawley · January 14, 2022, 5:34pm

Thanks for your reply. I did include a hyperlink to the full code at the Julia Academy GitHub repo (^see “the notebook from the course”), and then just posted the few lines I changed.

If you move the parameters of the model to GPU, you will get a new model (like a newly allocated structs) and if you use params ps collected on a model of CPU, you need to recollect them from the model on GPU, using the same Flux.params(m)

Not sure I understand, because I thought that’s what I was already doing: after the model gets moved to GPU via |> gpu, then running Flux.params(m) was still only returning non-GPU -style parameters.

When you run repeated((X, Y), 200) , you are creating an iterator which will spits (x,y) 200x. When you pass that iterator through gpu function, I do not this the gpu dive into the iterator. This is why you need to do repeated(gpu((x,y)), 200)

Ahhhh… Didn’t realize it was an iterator!! That makes sense. If it had actually created a larger-dimensional array (or a view that functioned like one, e.g. akin to numpy’s tile operation), then presumably what I’d written would’ve worked.

This is sort of my guesses, what might be wrong. Without MWE, it is difficult to find all errors.

Again, it’s just the original course notebook with the “diff” of the three changes posted above. Thanks for your input though, this helps!

drscotthawley · January 14, 2022, 5:35pm

Oooo, this is the first I’ve heard of Functors. I will look into that. Thank you for your helpful suggestion!

Topic		Replies	Views
Multi-headed Network on GPU with Flux Machine Learning flux	0	690	June 24, 2019
Training with Flux.jl on the GPU causes ArgumentError: cannot take the CPU address of a CuArray GPU question , gpu , flux , machine-learning , neural-network	4	1099	May 28, 2022
Training Deep Neural Network using Data Parallel? Machine Learning parallel , flux	5	1316	January 24, 2022
Simple Flux NN + GPU error New to Julia question	2	2208	March 21, 2019
CuArrays Question (cannot take the CPU address of a CuArray{Float32,2,Nothing}) New to Julia gpu , first-steps , cuarrays	2	2468	October 20, 2020

Data Science lessons: Making "10 - Neural Networks" run on GPU?

Related topics