Hi, I’ve been doing Dr. Nassar’s @nassarhuda’s excellent Data Science tutorial series on Julia Academy and just finished lesson 10 on Neural Nets (Neural Nets | JuliaAcademy).
I’m now trying to edit the notebook from the course so it will run on GPU (I have CUDA installed and an NVIDIA card), but so far no success. Can anyone advise?
After reading the GPU Support documentation, here are the key changes I made to the lesson notebook:
using CUDA
...
m = Chain(
  Dense(28^2, 32, relu),
  Dense(32, 10),
  softmax) |> gpu
...
datasetx = repeated((X, Y), 200) |> gpu
…But then when I run the Flux.train! line, I get big long error trace:
opt = ADAM()
Flux.train!(loss, ps, datasetx, opt, cb = throttle(evalcb, 10))
ArgumentError: cannot take the CPU address of a CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}
“CPU address”? Huh? I thought I moved everything – the model & the data – over to the GPU. (With PyTorch, that would be sufficient. But I’m “New to Julia”. ) .
Stacktrace:
  [1] unsafe_convert(#unused#::Type{Ptr{Float32}}, x::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
    @ CUDA ~/.julia/packages/CUDA/sCev8/src/array.jl:315
  [2] gemm!(transA::Char, transB::Char, alpha::Float32, A::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, B::Matrix{Float32}, beta::Float32, C::Matrix{Float32})
    @ LinearAlgebra.BLAS /opt/julia-1.7.1/share/julia/stdlib/v1.7/LinearAlgebra/src/blas.jl:1421
  [3] gemm_wrapper!(C::Matrix{Float32}, tA::Char, tB::Char, A::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, B::Matrix{Float32}, _add::LinearAlgebra.MulAddMul{true, true, Bool, Bool})
    @ LinearAlgebra /opt/julia-1.7.1/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:671
  [4] mul!
    @ /opt/julia-1.7.1/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:169 [inlined]
  [5] mul!
    @ /opt/julia-1.7.1/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:275 [inlined]
  [6] *
    @ /opt/julia-1.7.1/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:160 [inlined]
  [7] rrule
    @ ~/.julia/packages/ChainRules/2moSB/src/rulesets/Base/arraymath.jl:60 [inlined]
  [8] rrule
    @ ~/.julia/packages/ChainRulesCore/IFusD/src/rules.jl:134 [inlined]
  [9] chain_rrule
    @ ~/.julia/packages/Zygote/umM0L/src/compiler/chainrules.jl:216 [inlined]
 [10] macro expansion
    @ ~/.julia/packages/Zygote/umM0L/src/compiler/interface2.jl:0 [inlined]
 [11] _pullback
    @ ~/.julia/packages/Zygote/umM0L/src/compiler/interface2.jl:9 [inlined]
 [12] _pullback
    @ ~/.julia/packages/Flux/0c9kI/src/layers/basic.jl:147 [inlined]
 [13] _pullback(ctx::Zygote.Context, f::Dense{typeof(relu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, args::Matrix{Float32})
    @ Zygote ~/.julia/packages/Zygote/umM0L/src/compiler/interface2.jl:0
 [14] _pullback
    @ ~/.julia/packages/Flux/0c9kI/src/layers/basic.jl:36 [inlined]
 [15] _pullback(::Zygote.Context, ::typeof(Flux.applychain), ::Tuple{Dense{typeof(relu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, typeof(softmax)}, ::Matrix{Float32})
    @ Zygote ~/.julia/packages/Zygote/umM0L/src/compiler/interface2.jl:0
 [16] _pullback
    @ ~/.julia/packages/Flux/0c9kI/src/layers/basic.jl:38 [inlined]
 [17] _pullback(ctx::Zygote.Context, f::Chain{Tuple{Dense{typeof(relu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(identity), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, typeof(softmax)}}, args::Matrix{Float32})
    @ Zygote ~/.julia/packages/Zygote/umM0L/src/compiler/interface2.jl:0
 [18] _pullback
    @ ./In[21]:1 [inlined]
 [19] _pullback(::Zygote.Context, ::typeof(loss), ::Matrix{Float32}, ::Flux.OneHotArray{UInt32, 10, 1, 2, Vector{UInt32}})
    @ Zygote ~/.julia/packages/Zygote/umM0L/src/compiler/interface2.jl:0
 [20] _apply
    @ ./boot.jl:814 [inlined]
 [21] adjoint
    @ ~/.julia/packages/Zygote/umM0L/src/lib/lib.jl:200 [inlined]
 [22] _pullback
    @ ~/.julia/packages/ZygoteRules/AIbCs/src/adjoint.jl:65 [inlined]
 [23] _pullback
    @ ~/.julia/packages/Flux/0c9kI/src/optimise/train.jl:102 [inlined]
 [24] _pullback(::Zygote.Context, ::Flux.Optimise.var"#39#45"{typeof(loss), Tuple{Matrix{Float32}, Flux.OneHotArray{UInt32, 10, 1, 2, Vector{UInt32}}}})
    @ Zygote ~/.julia/packages/Zygote/umM0L/src/compiler/interface2.jl:0
 [25] pullback(f::Function, ps::Zygote.Params)
    @ Zygote ~/.julia/packages/Zygote/umM0L/src/compiler/interface.jl:352
 [26] gradient(f::Function, args::Zygote.Params)
    @ Zygote ~/.julia/packages/Zygote/umM0L/src/compiler/interface.jl:75
 [27] macro expansion
    @ ~/.julia/packages/Flux/0c9kI/src/optimise/train.jl:101 [inlined]
 [28] macro expansion
    @ ~/.julia/packages/Juno/n6wyj/src/progress.jl:134 [inlined]
 [29] train!(loss::Function, ps::Zygote.Params, data::Base.Iterators.Take{Base.Iterators.Repeated{Tuple{Matrix{Float32}, Flux.OneHotArray{UInt32, 10, 1, 2, Vector{UInt32}}}}}, opt::ADAM; cb::Flux.var"#throttled#73"{Flux.var"#throttled#69#74"{Bool, Bool, var"#1#2", Int64}})
    @ Flux.Optimise ~/.julia/packages/Flux/0c9kI/src/optimise/train.jl:99
 [30] top-level scope
    @ In[26]:2
 [31] eval
    @ ./boot.jl:373 [inlined]
 [32] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1196
I find it odd that, even after converting the model to GPU, the datatype of the model parameters is still just Float32 instead of CuArray…  i.e. I was expecting ps = Flux.params(m) to yield not Params([Float32... but rather say something about model parameters as CUDA variables.  (And running
   (And running m = fmap(cu, m) made no changes to that, BTW.)
Also, after moving datasetx to |> gpu (or alternately, wrapping in it cu()), printing it out only shows regular Float32-style variable types, nothing about CUDA or CuArray, etc.
…?
UPDATES:
- 
fmapapplyingcuondatasetxalso has no effect
- regular mapofcuondatasetxcauses a CUDA Out Of Memory error.
- the following two lines seem to be sufficient for success:
 datasetx = repeated((X|>gpu, Y|>gpu), 200)(not what I had before) andevalcb = () -> @show(loss(X|>gpu, Y|>gpu))
- Alternatively, moving X and Y individually to GPU first, as in X = X |> gpu; Y = Y |> gpu, before the definitations of datasetx and evalcb, also seems sufficient.