Training with Flux.jl on the GPU causes ArgumentError: cannot take the CPU address of a CuArray

lfenzo · May 26, 2022, 11:12pm

Hello!

I’m trying to train a neural network using Flux.jl on the GPU with CUDA.jl. My issue (see error below) is that there seems to be some problem during the data transfer process between the GPU and CPU which causes the references to arrays to be lost. Even though I have no experience with GPU programming, it seems logical that both data and the model should be transferred to GPU memory in order to perform training and inference, but I think I must be missing something. This is the code I’m using to perform the training:

using Flux
using JLD2
using CUDA
using FileIO

function train(xtrain, ytrain, epochs::Integer) 

    input_shape = size(xtrain, 1)
    dataloader = Flux.DataLoader((xtrain, ytrain),
                                 batchsize = 64,
                                 shuffle = true) |> gpu

    model = Chain(
        Dense(input_shape => 256, relu),
        Dense(256 => 32, relu),
        Dense(32 => 1, sigmoid),
    ) |> gpu

    loss(x, y) = Flux.Losses.mse(model(x), y)
    optimizer = Flux.ADAM()
    trainable_params = Flux.params(model)

    for epoch in 1:epochs
        Flux.train!(loss, trainable_params, dataloader, optimizer)
    end
end

function main()
    (xtrain, ytrain) = FileIO.load("../data/tmp/nn_ready.jld2", "xtrain", "ytrain")
    (xtest, ytest) = FileIO.load("../data/tmp/nn_ready.jld2", "xtest", "ytest")
    train(xtrain, ytrain, 3)
end

main()

Which produces the following error:

ERROR: LoadError: ArgumentError: cannot take the CPU address of a CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}

Full error stacktrace

ERROR: LoadError: ArgumentError: cannot take the CPU address of a CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}
Stacktrace:
  [1] unsafe_convert(#unused#::Type{Ptr{Float64}}, x::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer})
    @ CUDA ~/.julia/packages/CUDA/fAEDi/src/array.jl:319
  [2] gemm!(transA::Char, transB::Char, alpha::Float64, A::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, B::Matrix{Float64}, beta::Float64, C::Matrix{Float64})
    @ LinearAlgebra.BLAS /usr/share/julia/stdlib/v1.7/LinearAlgebra/src/blas.jl:1421
  [3] gemm_wrapper!(C::Matrix{Float64}, tA::Char, tB::Char, A::CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, B::Matrix{Float64}, _add::LinearAlgebra.MulAddMul{true, true, Bool, Bool})
    @ LinearAlgebra /usr/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:671
  [4] mul!
    @ /usr/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:169 [inlined]
  [5] mul!
    @ /usr/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:275 [inlined]
  [6] *(A::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, B::Matrix{Float64})
    @ LinearAlgebra /usr/share/julia/stdlib/v1.7/LinearAlgebra/src/matmul.jl:160
  [7] rrule
    @ ~/.julia/packages/ChainRules/MsNLy/src/rulesets/Base/arraymath.jl:64 [inlined]
  [8] rrule
    @ ~/.julia/packages/ChainRulesCore/RbX5a/src/rules.jl:134 [inlined]
  [9] chain_rrule
    @ ~/.julia/packages/Zygote/DkIUK/src/compiler/chainrules.jl:217 [inlined]
 [10] macro expansion
    @ ~/.julia/packages/Zygote/DkIUK/src/compiler/interface2.jl:0 [inlined]
 [11] _pullback
    @ ~/.julia/packages/Zygote/DkIUK/src/compiler/interface2.jl:9 [inlined]
 [12] _pullback
    @ ~/.julia/packages/Flux/6Q5r4/src/layers/basic.jl:159 [inlined]
 [13] macro expansion
    @ ~/.julia/packages/Flux/6Q5r4/src/layers/basic.jl:53 [inlined]
 [14] _pullback
    @ ~/.julia/packages/Flux/6Q5r4/src/layers/basic.jl:53 [inlined]
 [15] _pullback(::Zygote.Context, ::typeof(Flux.applychain), ::Tuple{Dense{typeof(relu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(relu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(σ), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, ::Matrix{Float64})
    @ Zygote ~/.julia/packages/Zygote/DkIUK/src/compiler/interface2.jl:0
 [16] _pullback
    @ ~/.julia/packages/Flux/6Q5r4/src/layers/basic.jl:51 [inlined]
 [17] _pullback
    @ ~/Documents/projetos/dl-solar-sao-paulo/src/train_model.jl:19 [inlined]
 [18] _pullback(::Zygote.Context, ::var"#loss#19"{Chain{Tuple{Dense{typeof(relu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(relu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(σ), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}, ::Matrix{Float64}, ::LinearAlgebra.Adjoint{Float64, Vector{Float64}})
    @ Zygote ~/.julia/packages/Zygote/DkIUK/src/compiler/interface2.jl:0
 [19] _apply
    @ ./boot.jl:814 [inlined]
 [20] adjoint
    @ ~/.julia/packages/Zygote/DkIUK/src/lib/lib.jl:204 [inlined]
 [21] _pullback
    @ ~/.julia/packages/ZygoteRules/AIbCs/src/adjoint.jl:65 [inlined]
 [22] _pullback
    @ ~/.julia/packages/Flux/6Q5r4/src/optimise/train.jl:120 [inlined]
 [23] _pullback(::Zygote.Context, ::Flux.Optimise.var"#37#40"{var"#loss#19"{Chain{Tuple{Dense{typeof(relu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(relu), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Dense{typeof(σ), CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}, Tuple{Matrix{Float64}, LinearAlgebra.Adjoint{Float64, Vector{Float64}}}})
    @ Zygote ~/.julia/packages/Zygote/DkIUK/src/compiler/interface2.jl:0
 [24] pullback(f::Function, ps::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}})
    @ Zygote ~/.julia/packages/Zygote/DkIUK/src/compiler/interface.jl:352
 [25] gradient(f::Function, args::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}})
    @ Zygote ~/.julia/packages/Zygote/DkIUK/src/compiler/interface.jl:75
 [26] macro expansion
    @ ~/.julia/packages/Flux/6Q5r4/src/optimise/train.jl:119 [inlined]
 [27] macro expansion
    @ ~/.julia/packages/ProgressLogging/6KXlp/src/ProgressLogging.jl:328 [inlined]
 [28] train!(loss::Function, ps::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}}, data::MLUtils.DataLoader{Tuple{LinearAlgebra.Adjoint{Float64, Matrix{Float64}}, LinearAlgebra.Adjoint{Float64, Vector{Float64}}}, Random._GLOBAL_RNG}, opt::ADAM; cb::Flux.Optimise.var"#38#41")
    @ Flux.Optimise ~/.julia/packages/Flux/6Q5r4/src/optimise/train.jl:117
 [29] train!(loss::Function, ps::Zygote.Params{Zygote.Buffer{Any, Vector{Any}}}, data::MLUtils.DataLoader{Tuple{LinearAlgebra.Adjoint{Float64, Matrix{Float64}}, LinearAlgebra.Adjoint{Float64, Vector{Float64}}}, Random._GLOBAL_RNG}, opt::ADAM)
    @ Flux.Optimise ~/.julia/packages/Flux/6Q5r4/src/optimise/train.jl:114
 [30] train(xtrain::LinearAlgebra.Adjoint{Float64, Matrix{Float64}}, ytrain::LinearAlgebra.Adjoint{Float64, Vector{Float64}}, epochs::Int64)
    @ Main ~/Documents/projetos/dl-solar-sao-paulo/src/train_model.jl:24
 [31] main()
    @ Main ~/Documents/projetos/dl-solar-sao-paulo/src/train_model.jl:31
 [32] top-level scope
    @ ~/Documents/projetos/dl-solar-sao-paulo/src/train_model.jl:35
 [33] include(fname::String)
    @ Base.MainInclude ./client.jl:451
 [34] top-level scope
    @ REPL[7]:1
 [35] top-level scope
    @ ~/.julia/packages/CUDA/fAEDi/src/initialization.jl:52
in expression starting at /var/home/enzo/Documents/projetos/dl-solar-sao-paulo/src/train_model.jl:35

I thought I was possibly forgetting some |> gpu function in some of the other variables such as trainable_params, optimizer and the loss() function so I put it virtually everywhere but still the error persisted.

Any ideas on that I may be missing (or how to properly train neural nets on GPU)?

Thanks in advance!

Packages and environment info

Julia

julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

Packages

(@v1.7) pkg> status
      Status `~/.julia/environments/v1.7/Project.toml`
  [c7e460c6] ArgParse v1.1.4
  [6e4b80f9] BenchmarkTools v1.3.1
  [336ed68f] CSV v0.10.4
  [052768ef] CUDA v3.10.0
  [13f3f980] CairoMakie v0.8.1
  [a93c6f00] DataFrames v1.3.4
  [1313f7d8] DataFramesMeta v0.11.0
  [864edb3b] DataStructures v0.18.12
  [e30172f5] Documenter v0.27.16
  [5789e2e9] FileIO v1.14.0
  [587475ba] Flux v0.13.1
  [f7bf1975] Impute v0.6.8
  [033835bb] JLD2 v0.4.22
  [2b0e0bc5] LanguageServer v4.2.0
  [add582a8] MLJ v0.18.2
  [14b8a8f1] PkgTemplates v0.7.26
  [295af30f] Revise v3.3.3
  [1277b4bf] ShiftedArrays v1.0.0
  [b3cc710f] StaticLint v8.1.0
  [69024149] StringEncodings v0.3.5
  [cf896787] SymbolServer v7.2.0
  [a5390f91] ZipFile v0.9.4

CUDA.jl

julia> CUDA.versioninfo()
CUDA toolkit 11.7, artifact installation
NVIDIA driver 510.68.2, for CUDA 11.6
CUDA driver 11.6

Libraries: 
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 11.0.0+510.68.2
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.7.2
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: NVIDIA GeForce MX150 (sm_61, 3.885 GiB / 4.000 GiB available)

contradict · May 27, 2022, 12:07am

It doesn’t look like your data is one the GPU. Maybe you need to load the data on the GPU before constructing the DataLoader?

dataloader = Flux.DataLoader((xtrain |> gpu, ytrain |> gpu),
                             batchsize = 64,
                             shuffle = true)

Karthik-d-k · May 27, 2022, 2:58am

Yes, I also think the same and i always do something like below while creating DataLoaders (iff, whole dataset fits onto gpu completely)–>

dataloader = DataLoader(gpu.(collect.((xtrain, ytrain))), batchsize=64, shuffle=true)

wherein, collect. creates vector of (x, y) and gpu. will load them onto the gpu…!! (. after function names is imp in both cases)

lfenzo · May 28, 2022, 2:22pm

Thanks @contradict and @Karthik-d-k !

As you pointed, it turned out that the data was not properly loaded into the GPU. Using both ((xtrain |> gpu, ytrain |> gpu)) and gou.(collect.((xtrain, ytrain))) approaches inside the DataLoader worked for me and had virtually the same performance.

The curious thing though (or ironic to say the least) is that my GPU is outperformed by the CPU when using 4 threads. Benchmark shows ~1.0K batch iterations per second in the GPU against ~1.5K on the CPU. I guess this makes sense for a notebook GPU (Nvidia MX150 4GB with 384 CUDA Cores)

Karthik-d-k · May 28, 2022, 4:30pm

That’s interesting !!
Would this be because of using batchsize = 64 and if we increase it to max out GPU memory, we would be able to squeeze more performance out of GPU

Topic		Replies	Views
Flux.jl: training fails at GPU but works on CPU Machine Learning gpu , flux	1	642	September 19, 2019
Simple Flux NN + GPU error New to Julia question	2	2218	March 21, 2019
MNIST GPU CuArrays error GPU	23	3158	January 22, 2019
Multi-headed Network on GPU with Flux Machine Learning flux	0	700	June 24, 2019
Data Science lessons: Making "10 - Neural Networks" run on GPU? New to Julia gpu , flux	4	762	January 14, 2022

Training with Flux.jl on the GPU causes ArgumentError: cannot take the CPU address of a CuArray

Related topics