The mini-batches are about 6000 and are taking about 15 seconds each training using ADAM or RMSProp, which seems painfully slow.
My colleague wrote a PyTorch version which runs nearly 100 times faster on the same machine.
There must be a bottleneck somewhere, but I can’t seem to find it.
I am using a cross-entropy loss function (slightly modified, but compatible with Zygote).
I timed the evaluation of the loss function for the entire dataset and it was only about 0.3 seconds
Does anyone have any idea why it should be running so slowly?
No, I’m not using GPU. I know that when I first loaded Flux, there was an error relating to GPU but I tried again and it seemed to work. Maybe it didn’t load properly.
If I restart and type using Flux, I get a complaint about CUDA and NVidia drivers:
\u250c Warning: CUDA.jl only supports NVIDIA drivers for CUDA 9.0 or higher (yours is for CUDA 6.5.0)
\u2514 @ CUDA /home/davide/.julia/packages/CUDA/dZvbp/src/initialization.jl:107
InitError: Could not find a suitable CUDA installation
during initialization of module Flux
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] runtime_init() at /home/davide/.julia/packages/CUDA/dZvbp/src/initialization.jl:110
[3] (::CUDA.var"#609#610"{Bool})() at /home/davide/.julia/packages/CUDA/dZvbp/src/initialization.jl:32
[4] lock(::CUDA.var"#609#610"{Bool}, ::ReentrantLock) at ./lock.jl:161
[5] _functional(::Bool) at /home/davide/.julia/packages/CUDA/dZvbp/src/initialization.jl:26
[6] functional(::Bool) at /home/davide/.julia/packages/CUDA/dZvbp/src/initialization.jl:19
[7] functional at /home/davide/.julia/packages/CUDA/dZvbp/src/initialization.jl:18 [inlined]
[8] init() at /home/davide/.julia/packages/Flux/05b38/src/Flux.jl:53
[9] _include_from_serialized(::String, ::Array{Any,1}) at ./loading.jl:697
[10] _require_search_from_serialized(::Base.PkgId, ::String) at ./loading.jl:782
[11] _require(::Base.PkgId) at ./loading.jl:1007
[12] require(::Base.PkgId) at ./loading.jl:928
[13] require(::Module
But if I type using Flux again, there is no complaint. I ignored this because I don’t use GPU, but could it have other ramifications? Any idea about how to do a clean install?
I think it should be fine but this is a performance problem. If your colleague is using a GPU that could explain thedifference.
Edit: sorry didnt See same machine
It’s kind of a fishing expedition without the code.
I can clean up the code and include it here in a few minutes, but there is only the model (given above), the loss function (which I have benchmarked) and the call to train! with ADAM or RMSProp
function xent_loss(x0,y0)
mc=m1(x0).-mean(m1(x0))
pden=exp.(mc)*y0[2:end,:]
p=exp.(mc)./pden
lossc=-sum(y0[1,:]'.*log.(p))
return lossc
end
for epoch_idx in 1:maxiters
Flux.train!(xent_loss, Flux.params(m1),tr_data, opt;cb=cb)
cb()
end
Thanks, but I am sorry, but I didn’t follow any of what you said, and I couldn’t see any reference to ‘views’ in the link.
Also, I checked and updating the CUDA driver in Ubuntu is very difficult (7 steps involving patches, setting environment variables, etc.). If I had a system administrator he/she could do it, but I don’t
But in any case, the whole evaluation of the loss function even on the entire training set takes only 0.3 seconds. Why should one gradient step take 15 seconds?
BTW I would feel better about Flux loading properly. Any idea how I might achieve that? [Barring upgrading CUDA drivers which is too hard for me]
As the author of BetterExp, don’t use it here (at least not yet). The performance difference between base exp and the version in this package is only a maximum of 10x, so this isn’t the main problem. The main problem is almost certainly cpu vs gpu. A secondary problem is the memory allocation, which should be fixed with @views. Furthermore, with a little bit of luck BetterExp will have it’s improvements merged into Base by 1.6, at which point the library will be obsolete.