My Flux Application painfully slow

My Flux application runs painfully slowly.

My training set is about 250,000 with about 100 inputs with the following setup:

Chain(Dense(100,40,relu), Dense(40,40,relu), Dense(40,1,identity))

The mini-batches are about 6000 and are taking about 15 seconds each training using ADAM or RMSProp, which seems painfully slow.

My colleague wrote a PyTorch version which runs nearly 100 times faster on the same machine.

There must be a bottleneck somewhere, but I can’t seem to find it.

I am using a cross-entropy loss function (slightly modified, but compatible with Zygote).
I timed the evaluation of the loss function for the entire dataset and it was only about 0.3 seconds

Does anyone have any idea why it should be running so slowly?

Thanks for any hints!

1 Like

Could you post some more code so others could test it?
Are you using a GPU?

No, I’m not using GPU. I know that when I first loaded Flux, there was an error relating to GPU but I tried again and it seemed to work. Maybe it didn’t load properly.

If I restart and type using Flux, I get a complaint about CUDA and NVidia drivers:

\u250c Warning: CUDA.jl only supports NVIDIA drivers for CUDA 9.0 or higher (yours is for CUDA 6.5.0)
\u2514 @ CUDA /home/davide/.julia/packages/CUDA/dZvbp/src/initialization.jl:107

InitError: Could not find a suitable CUDA installation
during initialization of module Flux

[1] error(::String) at ./error.jl:33
[2] runtime_init() at /home/davide/.julia/packages/CUDA/dZvbp/src/initialization.jl:110
[3] (::CUDA.var"#609#610"{Bool})() at /home/davide/.julia/packages/CUDA/dZvbp/src/initialization.jl:32
[4] lock(::CUDA.var"#609#610"{Bool}, ::ReentrantLock) at ./lock.jl:161
[5] _functional(::Bool) at /home/davide/.julia/packages/CUDA/dZvbp/src/initialization.jl:26
[6] functional(::Bool) at /home/davide/.julia/packages/CUDA/dZvbp/src/initialization.jl:19
[7] functional at /home/davide/.julia/packages/CUDA/dZvbp/src/initialization.jl:18 [inlined]
[8] init() at /home/davide/.julia/packages/Flux/05b38/src/Flux.jl:53
[9] _include_from_serialized(::String, ::Array{Any,1}) at ./loading.jl:697
[10] _require_search_from_serialized(::Base.PkgId, ::String) at ./loading.jl:782
[11] _require(::Base.PkgId) at ./loading.jl:1007
[12] require(::Base.PkgId) at ./loading.jl:928
[13] require(::Module

But if I type using Flux again, there is no complaint. I ignored this because I don’t use GPU, but could it have other ramifications? Any idea about how to do a clean install?

I think it should be fine but this is a performance problem. If your colleague is using a GPU that could explain thedifference.
Edit: sorry didnt See same machine

It’s kind of a fishing expedition without the code.

I can clean up the code and include it here in a few minutes, but there is only the model (given above), the loss function (which I have benchmarked) and the call to train! with ADAM or RMSProp

function xent_loss(x0,y0)
    return lossc

for epoch_idx in 1:maxiters
    Flux.train!(xent_loss, Flux.params(m1),tr_data, opt;cb=cb)

You’ll need to upgrade your NVIDIA driver to something more recent.

OK - thanks. I didn’t think that would matter.

I didn’t think I was using GPU’s, but am I?

Maybe PyTorch (in Python) was using a compatible system.

I don’t know how to upgrade on Ubuntu, but I can take some time and figure it out.


I think the slices in the loss allocate so maybe using views would be faster. Check out the section on allocation here

Thanks, but I am sorry, but I didn’t follow any of what you said, and I couldn’t see any reference to ‘views’ in the link.

Also, I checked and updating the CUDA driver in Ubuntu is very difficult (7 steps involving patches, setting environment variables, etc.). If I had a system administrator he/she could do it, but I don’t

Will that slow me down?

Sorry. This I think will allocate new memory ie be slow


Alternatively you can use


Which does not allocate but reference the original memory.

1 Like


This is the example from the blog so it seems the () are unnecessary

function sum_neighborhoods(A, n::Int)
    return [ sum(@view A[i:i+n-1, j:j+n-1]) for i = 1:n:size(A,1), j = 1:n:size(A,2) ]

Unfortunately Zygote didn’t like it :frowning:

I have found it to be very temperamental (doesn’t even allow element-wise operations!)

Ah not good. I have no idea how the Zygote magic works. Sorry.

Two more possible ideas:

Compute and store m1(x0) and exp.(mc) only once depending on how long that takes.

In have never used it but maybe

Could improve speed?

But in any case, the whole evaluation of the loss function even on the entire training set takes only 0.3 seconds. Why should one gradient step take 15 seconds?

BTW I would feel better about Flux loading properly. Any idea how I might achieve that? [Barring upgrading CUDA drivers which is too hard for me]

As the author of BetterExp, don’t use it here (at least not yet). The performance difference between base exp and the version in this package is only a maximum of 10x, so this isn’t the main problem. The main problem is almost certainly cpu vs gpu. A secondary problem is the memory allocation, which should be fixed with @views. Furthermore, with a little bit of luck BetterExp will have it’s improvements merged into Base by 1.6, at which point the library will be obsolete.

1 Like

OK - thanks for your help. Maybe I will hire someone to install CUDA

Have you tried using Float32 instead of Float64?

Have a look at this:

1 Like

Hi. No, I haven’t but I am only using 12% of available memory on my machine. Would it help in any case?