MNIST GPU CuArrays error

Hi, I’m trying to train MNIST over GPU using julia, but I face the following error.

here is my code:

X = hcat(float.(reshape.(imgs, :))...) |>gpu;
Y = onehotbatch(labels, 0:9) |> gpu; 
batches=[(X[:,:,:,i],Y[:,i]) for i in partition(1:size(X,4),100)];

m = Chain(
  Dense(28^2, 32, relu),
  Dense(32, 10),
  softmax) |>gpu

loss(x, y) = crossentropy(m(x), y)
accuracy(x, y) = mean(onecold(m(x)) .== onecold(y))
evalcb = throttle(() -> @show(accuracy(X, Y)), 600)
opt = ADAM(params(m))
@time @epochs 45 Flux.train!(loss, batches, opt, cb = evalcb)

Here is the error:

MethodError: no method matching *(::TrackedArray{…,CuArray{Float32,2}}, ::CuArray{Float32,4})
Closest candidates are:
  *(::Any, ::Any, !Matched::Any, !Matched::Any...) at operators.jl:502
  *(::TrackedArray{T,2,A} where A where T, !Matched::TrackedArray{T,2,A} where A where T) at /home/fadi/.julia/packages/Flux/jsf3Y/src/tracker/array.jl:320
  *(::TrackedArray{T,2,A} where A where T, !Matched::TrackedArray{T,1,A} where A where T) at /home/fadi/.julia/packages/Flux/jsf3Y/src/tracker/array.jl:324
  ...

Stacktrace:
 [1] (::Dense{typeof(relu),TrackedArray{…,CuArray{Float32,2}},TrackedArray{…,CuArray{Float32,1}}})(::CuArray{Float32,4}) at /home/fadi/.julia/packages/Flux/jsf3Y/src/layers/basic.jl:80
 [2] (::getfield(Flux, Symbol("##60#61")))(::CuArray{Float32,4}, ::Dense{typeof(relu),TrackedArray{…,CuArray{Float32,2}},TrackedArray{…,CuArray{Float32,1}}}) at /home/fadi/.julia/packages/Flux/jsf3Y/src/layers/basic.jl:31
 [3] mapfoldl_impl(::typeof(identity), ::getfield(Flux, Symbol("##60#61")), ::NamedTuple{(:init,),Tuple{CuArray{Float32,4}}}, ::Array{Any,1}) at ./reduce.jl:43
 [4] #mapfoldl#170 at ./reduce.jl:70 [inlined]
 [5] #mapfoldl at ./none:0 [inlined]
 [6] #foldl#171 at ./reduce.jl:88 [inlined]
 [7] #foldl at ./none:0 [inlined]
 [8] (::Chain)(::CuArray{Float32,4}) at /home/fadi/.julia/packages/Flux/jsf3Y/src/layers/basic.jl:31
 [9] loss(::CuArray{Float32,4}, ::Flux.OneHotMatrix{CuArray{Flux.OneHotVector,1}}) at ./In[18]:6
 [10] #train!#121(::getfield(Flux, Symbol("#throttled#18")){getfield(Flux, Symbol("##throttled#10#14")){Bool,Bool,getfield(Main, Symbol("##12#13")),Int64}}, ::Function, ::Function, ::Array{Tuple{CuArray{Float32,4},Flux.OneHotMatrix{CuArray{Flux.OneHotVector,1}}},1}, ::getfield(Flux.Optimise, Symbol("##43#47"))) at /home/fadi/.julia/packages/Juno/46C8i/src/progress.jl:109
 [11] (::getfield(Flux.Optimise, Symbol("#kw##train!")))(::NamedTuple{(:cb,),Tuple{getfield(Flux, Symbol("#throttled#18")){getfield(Flux, Symbol("##throttled#10#14")){Bool,Bool,getfield(Main, Symbol("##12#13")),Int64}}}}, ::typeof(Flux.Optimise.train!), ::Function, ::Array{Tuple{CuArray{Float32,4},Flux.OneHotMatrix{CuArray{Flux.OneHotVector,1}}},1}, ::Function) at ./none:0
 [12] top-level scope at /home/fadi/.julia/packages/Juno/46C8i/src/progress.jl:109

Looks like a Flux issue. Should be a better fit for the machine learning category.

Try to follow this guide and ignore the bits that are windows only:

got the mnsit to work

Hi again,

I was able to fix the error, it was a naive dimension issue, since X is 2D not 4D.

However, the processing time on GPU is much much much more than on CPU !

My GPU is a moderate one, it’s “NVIDIA Quadro K420” So is this a known issue or something ?

Please provide fully runnable examples as well as what time you get with CPU / GPU.

Also see onecold is very slow · Issue #556 · FluxML/Flux.jl · GitHub which can be a big bottleneck. Try to move the arrays to the CPU before calling onecold.

Here is the full code:

imgs = MNIST.images();
labels = MNIST.labels();

ON CPU:

X = hcat(float.(reshape.(imgs, :))…);
Y = onehotbatch(labels, 0:9);
batches=[(X[:,i],Y[:,i]) for i in partition(1:size(X,2),100)];

m = Chain(
Dense(28^2, 32, relu),
Dense(32, 10),
softmax)

loss(x, y) = crossentropy(m(x), y)
accuracy(x, y) = mean(onecold(m(x)) .== onecold(y))
evalcb = throttle(() → (accuracy(X, Y)), 1)
opt = ADAM(params(m))
@epochs 10 Flux.train!(loss, batches, opt, cb = evalcb)

It goes from epoch 1 to 10 in 15.583189 seconds:

┌ Info: Epoch 10
└ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93

accuracy(X, Y) = 0.9737666666666667
15.583189 seconds (39.12 M allocations: 7.897 GiB, 8.83% gc time)

ON GPU:

X = hcat(float.(reshape.(imgs, :))…) |>gpu;
Y = onehotbatch(labels, 0:9) |>gpu;
batches=[(X[:,i],Y[:,i]) for i in partition(1:size(X,2),100)];

m = Chain(
Dense(28^2, 32, relu),
Dense(32, 10),
softmax) |>gpu

loss(x, y) = crossentropy(m(x), y)
accuracy(x, y) = mean(onecold(m(x)) .== onecold(y))
evalcb = throttle(() → (accuracy(X, Y)), 1)
opt = ADAM(params(m))

@epochs 10 Flux.train!(loss, batches, opt, cb = evalcb)

It takes 10min+ to finish 1 epoch, also I get below warning:

┌ Info: Epoch 1 └ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93 ┌ Warning: calls to Base intrinsics might be GPU incompatible │ exception = (CUDAnative.MethodSubstitutionWarning(exp(x::T) where T<:Union{Float32, Float64} in Base.Math at special/exp.jl:75, exp(x::Float32) in CUDAnative at /home/fadi/.julia/packages/CUDAnative/AGfq2/src/device/libdevice.jl:90), Base.StackTraces.StackFrame[exp at exp.jl:75, mapreducedim_kernel_parallel at mapreduce.jl:29]) └ CUDAnative /home/fadi/.julia/packages/CUDAnative/AGfq2/src/compiler/irgen.jl:111

It is still not the full code because copy pasting it gives errors.

Do you have CUDNN installed?

What happens if you do:

julia> using CuArrays

julia> CuArrays.libcudnn
"C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v10.0\\bin\\cudnn64_7.DLL"

Also, try remove the evalcb callback as a test (or at least move the arrays to the CPU before computing onecold to rule out onecold is very slow · Issue #556 · FluxML/Flux.jl · GitHub.

Hi kristoffer,

It denied me from posting the full, since there were a lot of “@” and it considers it as “mentions”, but anyways.

These are the libraries I used, and I got no errors while importing CuArrays,

using Flux, Flux.Data.MNIST, Statistics
using Flux: onehotbatch, onecold, crossentropy, throttle
using Base.Iterators: repeated, partition
using CuArrays, CUDAnative
using Images #Not so imporant here
using Flux: @epochs

If you can guide me with a snippet of code how to " move the arrays to the CPU before computing onecold" , it would be very good, as I don’t understand how to do it.

Please see PSA: how to quote code with backticks on how to quote your code to make it more readable to others.

For me, on GPU, running without the callback it takes 7 seconds for 10 epochs with a 2080 Ti. With the callback it takes longer than I have patience to wait. To run the onecold on CPU do:

accuracy(x, y) = mean(onecold(cpu(m(x))) .== onecold(cpu(y)))

Hi kristoffer,

I removed the callback, and did the CPU trick, it is running faster than before (at least I can wait for it to finish), but still much slower than on CPU.
And still the warning exists, which I believe is the problem.

GPU result:

X = hcat(float.(reshape.(imgs, :))...) |>gpu;
Y = onehotbatch(labels, 0:9) |>gpu; 
batches=[(X[:,i],Y[:,i]) for i in partition(1:size(X,2),100)];


m = Chain(
  Dense(28^2, 32, relu),
  Dense(32, 10),
  softmax) |>gpu

loss(x, y) = crossentropy(m(x), y)
accuracy(x, y) = mean(onecold(cpu(m(x))) .== onecold(cpu(y)))
#accuracy(x, y) = mean(onecold(m(x)) .== onecold(y))
opt = ADAM(params(m))

@time @epochs 10 Flux.train!(loss, batches, opt)


┌ Info: Epoch 1
└ @ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93
┌ Warning: calls to Base intrinsics might be GPU incompatible
│   exception = (CUDAnative.MethodSubstitutionWarning(exp(x::T) where T<:Union{Float32, Float64} in Base.Math at special/exp.jl:75, exp(x::Float32) in CUDAnative at /home/fadi/.julia/packages/CUDAnative/AGfq2/src/device/libdevice.jl:90), Base.StackTraces.StackFrame[exp at exp.jl:75, mapreducedim_kernel_parallel at mapreduce.jl:29])
└ @ CUDAnative /home/fadi/.julia/packages/CUDAnative/AGfq2/src/compiler/irgen.jl:111
┌ Warning: calls to Base intrinsics might be GPU incompatible
│   exception = (CUDAnative.MethodSubstitutionWarning(exp(x::T) where T<:Union{Float32, Float64} in Base.Math at special/exp.jl:75, exp(x::Float32) in CUDAnative at /home/fadi/.julia/packages/CUDAnative/AGfq2/src/device/libdevice.jl:90), Base.StackTraces.StackFrame[exp at exp.jl:75, mapreducedim_kernel_parallel at mapreduce.jl:29])
└ @ CUDAnative /home/fadi/.julia/packages/CUDAnative/AGfq2/src/compiler/irgen.jl:111
┌ Info: Epoch 2
└ @ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93
┌ Info: Epoch 3
└ @ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93
┌ Info: Epoch 4
└ @ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93
┌ Info: Epoch 5
└ @ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93
┌ Info: Epoch 6
└ @ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93
┌ Info: Epoch 7
└ @ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93
┌ Info: Epoch 8
└ @ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93
┌ Info: Epoch 9
└ @ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93
┌ Info: Epoch 10
└ @ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93

506.988604 seconds (329.84 M allocations: 14.589 GiB, 0.63% gc time)

CPU Result

X = hcat(float.(reshape.(imgs, :))...);
Y = onehotbatch(labels, 0:9); 
batches=[(X[:,i],Y[:,i]) for i in partition(1:size(X,2),100)];

m = Chain(
  Dense(28^2, 32, relu),
  Dense(32, 10),
  softmax)

loss(x, y) = crossentropy(m(x), y)
accuracy(x, y) = mean(onecold(m(x)) .== onecold(y))
evalcb = throttle(() -> @show(accuracy(X, Y)), 1)
opt = ADAM(params(m))

@time @epochs 10 Flux.train!(loss, batches, opt, cb = evalcb)


┌ Info: Epoch 1
└ @ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93

accuracy(X, Y) = 0.11203333333333333

┌ Info: Epoch 2
└ @ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93

accuracy(X, Y) = 0.9235166666666667

┌ Info: Epoch 3
└ @ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93

accuracy(X, Y) = 0.94425

┌ Info: Epoch 4
└ @ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93

accuracy(X, Y) = 0.9515833333333333

┌ Info: Epoch 5
└ @ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93

accuracy(X, Y) = 0.9593833333333334

┌ Info: Epoch 6
└ @ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93
┌ Info: Epoch 7
└ @ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93

accuracy(X, Y) = 0.9611333333333333

┌ Info: Epoch 8
└ @ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93

accuracy(X, Y) = 0.9678333333333333

┌ Info: Epoch 9
└ @ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93

accuracy(X, Y) = 0.97085

┌ Info: Epoch 10
└ @ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93

accuracy(X, Y) = 0.9727
 10.177217 seconds (17.96 M allocations: 6.846 GiB, 10.69% gc time)

Again, you didn’t say if you have CUDNN installed.

Hi Dear,

I’m sorry, I missed this, but yes I installed cudnn as per this guide, and nothing changed:

https://stackoverflow.com/questions/31326015/how-to-verify-cudnn-installation/51202754

if there is a way to confirm a proper installation of cudnn, please let me know.

I still get this warning:

┌ Info: Epoch 1
└ @ Main /home/fadi/.julia/packages/Flux/jsf3Y/src/optimise/train.jl:93
┌ Warning: calls to Base intrinsics might be GPU incompatible
│   exception = (CUDAnative.MethodSubstitutionWarning(exp(x::T) where T<:Union{Float32, Float64} in Base.Math at special/exp.jl:75, exp(x::Float32) in CUDAnative at /home/fadi/.julia/packages/CUDAnative/AGfq2/src/device/libdevice.jl:90), Base.StackTraces.StackFrame[exp at exp.jl:75, mapreducedim_kernel_parallel at mapreduce.jl:29])
└ @ CUDAnative /home/fadi/.julia/packages/CUDAnative/AGfq2/src/compiler/irgen.jl:111

If you don’t recall having signed up for cuDNN developer and haven’t downloaded the cuddn files and extracted the files to a particular location then you probably don’t have cuDNN.

Have you tried following the guide step by step?

Hi Xiaodai,

  • I did that, I’ve installed the cudnn binaries, and copied the libraries to cuda folders as mentioned, then restarted the server, it didn’t work.

  • Then I tried to download/install the .deb package, again restarted the server, and it didn’t work also.

  • in the guide above, he says I need to re build CuArrays and Flux after installing everything, so I’ll try this and update you tomorrow, as I can’t access it now.

Yes, it is needed, and is why I asked you to check:

Hi Dear,

Thanks a lot !
I think the problem is solved after the re “build”

before this line was giving nothing but now it prints the following:

CuArrays.libcudnn
"/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudnn.so"

Now the weird thing is that without Callback function it takes for 10 epochs:

CPU:

7.249866 seconds (3.60 M allocations: 6.050 GiB, 13.91% gc time)

GPU:

12.337841 seconds (15.14 M allocations: 599.245 MiB, 2.29% gc time)

Can this be due to the GPU type ?

I dont think it’s 10 epochs. The code prints every 10 seconds, and i believe the miji batch size was 1000. GPU was faster so finished within 20seconds. But one epoch only. I am guessing

It took 7 seconds for me on a 2080 TI so maybe.

Hi Dear,

Last result was without callback for both CPU & GPU to ensure no overhead, so the code was like this:

@time @epochs 10 Flux.train!(loss, batches, opt)