CuArrays Question (cannot take the CPU address of a CuArray{Float32,2,Nothing})

I’m new to CuArrays and there’s something I’m not quite understanding. I have a working example that pipes the data and the model to the GPU and trains there. But when I try to call the loss function outside of the training loop as loss(X, Y), I get the error

ArgumentError: cannot take the CPU address of a CuArray{Float32,2,Nothing}

even though I’ve previously moved X and Y to the GPU (with X |> gpu; Y |> gpu).

I found that the call loss(X |> gpu, Y |> gpu) works… but why should I have to do that if I’ve already moved X and Y to the GPU earlier in the code? Isn’t this horribly inefficient? The data doesn’t change, so I should only have to move it to the GPU once. It’s likely I’m misunderstanding something here, so any help would be appreciated! Minimum working example below.

using Flux
using CuArrays

num_samples, Ny, n = 50, 3, 5
X = rand(1,num_samples)
Y = rand(Ny, num_samples)
data = [(X[:,i], Y[:,i]) for i in 1:size(X,2)] |> gpu 
X |> gpu
Y |> gpu
m = Chain(Dense(size(X,1),n,relu),Dense(n,n,tanh),Dense(n,size(Y,1))) |> gpu
loss(x, y) = Flux.mse(m(x), y)
ps = Flux.params(m)

for i = 1:3
    println("Epoch "*string(i))
    Flux.train!(loss, ps, data, ADAM()) #Later: , cb=()->@show loss(X, Y)
@show loss(X |> gpu, Y |> gpu) # works
@show loss(X, Y) # does not work, even though X |> gpu before training

This produces the output below, with the error coming from the final line.

Epoch 1
Epoch 2
Epoch 3
loss(X |> gpu, Y |> gpu) = 0.13247906f0
ArgumentError: cannot take the CPU address of a CuArray{Float32,2,Nothing}

 [1] unsafe_convert(::Type{Ptr{Float32}}, ::CuArray{Float32,2,Nothing}) at C:\Users\username\.julia\packages\CuArrays\9n5uC\src\array.jl:226
 [2] gemm!(::Char, ::Char, ::Float32, ::CuArray{Float32,2,Nothing}, ::Array{Float32,2}, ::Float32, ::Array{Float32,2}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\LinearAlgebra\src\blas.jl:1167
 [3] gemm_wrapper!(::Array{Float32,2}, ::Char, ::Char, ::CuArray{Float32,2,Nothing}, ::Array{Float32,2}, ::LinearAlgebra.MulAddMul{true,true,Bool,Bool}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\LinearAlgebra\src\matmul.jl:597
 [4] mul! at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\LinearAlgebra\src\matmul.jl:169 [inlined]
 [5] mul! at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\LinearAlgebra\src\matmul.jl:208 [inlined]
 [6] * at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\LinearAlgebra\src\matmul.jl:160 [inlined]
 [7] (::Dense{typeof(relu),CuArray{Float32,2,Nothing},CuArray{Float32,1,Nothing}})(::Array{Float32,2}) at C:\Users\username\.julia\packages\Flux\Fj3bt\src\layers\basic.jl:122
 [8] Dense at C:\Users\username\.julia\packages\Flux\Fj3bt\src\layers\basic.jl:133 [inlined]
 [9] (::Dense{typeof(relu),CuArray{Float32,2,Nothing},CuArray{Float32,1,Nothing}})(::Array{Float64,2}) at C:\Users\username\.julia\packages\Flux\Fj3bt\src\layers\basic.jl:136
 [10] applychain at C:\Users\username\.julia\packages\Flux\Fj3bt\src\layers\basic.jl:36 [inlined]
 [11] (::Chain{Tuple{Dense{typeof(relu),CuArray{Float32,2,Nothing},CuArray{Float32,1,Nothing}},Dense{typeof(tanh),CuArray{Float32,2,Nothing},CuArray{Float32,1,Nothing}},Dense{typeof(identity),CuArray{Float32,2,Nothing},CuArray{Float32,1,Nothing}}}})(::Array{Float64,2}) at C:\Users\username\.julia\packages\Flux\Fj3bt\src\layers\basic.jl:38
 [12] loss(::Array{Float64,2}, ::Array{Float64,2}) at .\In[7]:11
 [13] top-level scope at show.jl:613
 [14] top-level scope at In[7]:19

Thanks to @rajnrao for a suggestion: rather than declaring X = rand(1,num_samples) then subsequently calling X |> gpu, if I call at once X = rand(1,num_samples) |> gpu (and do likewise for Y), then there is no need for the pipe to the GPU inside the loss function!

Conceptually, piping to gpu just calls the gpu() function on the input, so maybe gpu(X) tells Julia to evaluate X on the GPU (but otherwise do nothing) whereas X = gpu(rand(1,num_samples)) tells Julia to create a 1 x num_samples array on the GPU, and then have X point to this. Does that sound right?

Interestingly, if I benchmark calling loss() with X, Y vs. with X |> gpu, Y|> gpu, it’s about the same speed. (Actually, it’s about 2% faster with the |> gpu pipes, strangely: the number of allocations is slightly higher but the speed is also slightly higher, across several tests.)

You probably are getting the same speed for the loss because you haven’t put the loss function on the gpu the same way. In GPU programming you have to put the data on the GPU and tell the program to compute the function on the GPU as well. When you do loss without doing the latter, then it is fetching data from the GPU to the CPU and then evaluating it here which is why it takes the same time. The lines that put the model on the GPU are faster because you are telling Julia to compute the function on the GPU as well…