My Flux Application painfully slow

compleat · October 20, 2020, 8:34pm

My Flux application runs painfully slowly.

My training set is about 250,000 with about 100 inputs with the following setup:

Chain(Dense(100,40,relu), Dense(40,40,relu), Dense(40,1,identity))

The mini-batches are about 6000 and are taking about 15 seconds each training using ADAM or RMSProp, which seems painfully slow.

My colleague wrote a PyTorch version which runs nearly 100 times faster on the same machine.

There must be a bottleneck somewhere, but I can’t seem to find it.

I am using a cross-entropy loss function (slightly modified, but compatible with Zygote).
I timed the evaluation of the loss function for the entire dataset and it was only about 0.3 seconds

Does anyone have any idea why it should be running so slowly?

Thanks for any hints!

danielw2904 · October 20, 2020, 8:40pm

Could you post some more code so others could test it?
Are you using a GPU?

compleat · October 20, 2020, 8:42pm

No, I’m not using GPU. I know that when I first loaded Flux, there was an error relating to GPU but I tried again and it seemed to work. Maybe it didn’t load properly.

If I restart and type using Flux, I get a complaint about CUDA and NVidia drivers:

\u250c Warning: CUDA.jl only supports NVIDIA drivers for CUDA 9.0 or higher (yours is for CUDA 6.5.0)
\u2514 @ CUDA /home/davide/.julia/packages/CUDA/dZvbp/src/initialization.jl:107

InitError: Could not find a suitable CUDA installation
during initialization of module Flux

Stacktrace:
[1] error(::String) at ./error.jl:33
[2] runtime_init() at /home/davide/.julia/packages/CUDA/dZvbp/src/initialization.jl:110
[3] (::CUDA.var"#609#610"{Bool})() at /home/davide/.julia/packages/CUDA/dZvbp/src/initialization.jl:32
[4] lock(::CUDA.var"#609#610"{Bool}, ::ReentrantLock) at ./lock.jl:161
[5] _functional(::Bool) at /home/davide/.julia/packages/CUDA/dZvbp/src/initialization.jl:26
[6] functional(::Bool) at /home/davide/.julia/packages/CUDA/dZvbp/src/initialization.jl:19
[7] functional at /home/davide/.julia/packages/CUDA/dZvbp/src/initialization.jl:18 [inlined]
[8] init() at /home/davide/.julia/packages/Flux/05b38/src/Flux.jl:53
[9] _include_from_serialized(::String, ::Array{Any,1}) at ./loading.jl:697
[10] _require_search_from_serialized(::Base.PkgId, ::String) at ./loading.jl:782
[11] _require(::Base.PkgId) at ./loading.jl:1007
[12] require(::Base.PkgId) at ./loading.jl:928
[13] require(::Module

But if I type using Flux again, there is no complaint. I ignored this because I don’t use GPU, but could it have other ramifications? Any idea about how to do a clean install?

danielw2904 · October 20, 2020, 8:53pm

I think it should be fine but this is a performance problem. If your colleague is using a GPU that could explain thedifference.
Edit: sorry didnt See same machine

It’s kind of a fishing expedition without the code.

compleat · October 20, 2020, 8:56pm

I can clean up the code and include it here in a few minutes, but there is only the model (given above), the loss function (which I have benchmarked) and the call to train! with ADAM or RMSProp

function xent_loss(x0,y0)
    mc=m1(x0).-mean(m1(x0))
    pden=exp.(mc)*y0[2:end,:]
    p=exp.(mc)./pden
    lossc=-sum(y0[1,:]'.*log.(p))
    return lossc
end

for epoch_idx in 1:maxiters
    Flux.train!(xent_loss, Flux.params(m1),tr_data, opt;cb=cb)
    cb()
end

maleadt · October 20, 2020, 8:59pm

You’ll need to upgrade your NVIDIA driver to something more recent.

compleat · October 20, 2020, 9:02pm

OK - thanks. I didn’t think that would matter.

I didn’t think I was using GPU’s, but am I?

Maybe PyTorch (in Python) was using a compatible system.

I don’t know how to upgrade on Ubuntu, but I can take some time and figure it out.

Thanks!

danielw2904 · October 20, 2020, 9:04pm

I think the slices in the loss allocate so maybe using views would be faster. Check out the section on allocation here

compleat · October 20, 2020, 9:09pm

Thanks, but I am sorry, but I didn’t follow any of what you said, and I couldn’t see any reference to ‘views’ in the link.

Also, I checked and updating the CUDA driver in Ubuntu is very difficult (7 steps involving patches, setting environment variables, etc.). If I had a system administrator he/she could do it, but I don’t

Will that slow me down?

danielw2904 · October 20, 2020, 9:12pm

Sorry. This I think will allocate new memory ie be slow

y0[2:end,:]

Alternatively you can use

@view(y0[2:end,:])

Which does not allocate but reference the original memory.

compleat · October 20, 2020, 9:12pm

Awesome!

danielw2904 · October 20, 2020, 9:14pm

This is the example from the blog so it seems the () are unnecessary

function sum_neighborhoods(A, n::Int)
    return [ sum(@view A[i:i+n-1, j:j+n-1]) for i = 1:n:size(A,1), j = 1:n:size(A,2) ]
end

compleat · October 20, 2020, 9:17pm

Unfortunately Zygote didn’t like it

I have found it to be very temperamental (doesn’t even allow element-wise operations!)

danielw2904 · October 20, 2020, 9:18pm

Ah not good. I have no idea how the Zygote magic works. Sorry.

danielw2904 · October 20, 2020, 9:21pm

Two more possible ideas:

Compute and store m1(x0) and exp.(mc) only once depending on how long that takes.

In have never used it but maybe

https://github.com/oscardssmith/BetterExp.jl

Could improve speed?

compleat · October 20, 2020, 9:24pm

But in any case, the whole evaluation of the loss function even on the entire training set takes only 0.3 seconds. Why should one gradient step take 15 seconds?

BTW I would feel better about Flux loading properly. Any idea how I might achieve that? [Barring upgrading CUDA drivers which is too hard for me]

Oscar_Smith · October 20, 2020, 9:29pm

As the author of BetterExp, don’t use it here (at least not yet). The performance difference between base exp and the version in this package is only a maximum of 10x, so this isn’t the main problem. The main problem is almost certainly cpu vs gpu. A secondary problem is the memory allocation, which should be fixed with @views. Furthermore, with a little bit of luck BetterExp will have it’s improvements merged into Base by 1.6, at which point the library will be obsolete.

compleat · October 20, 2020, 9:32pm

OK - thanks for your help. Maybe I will hire someone to install CUDA

cojua8 · October 20, 2020, 10:16pm

Have you tried using Float32 instead of Float64?

Have a look at this: Performance Tips · Flux

compleat · October 20, 2020, 10:17pm

Hi. No, I haven’t but I am only using 12% of available memory on my machine. Would it help in any case?

Topic		Replies	Views
Flux running slow? Machine Learning	16	2802	August 19, 2021
Flux multi-cpu parallelism? New to Julia question , flux , zygote	9	2968	November 21, 2020
The same network performs differently in Flux.jl and tensorflow Machine Learning performance	13	3118	December 18, 2019
Flux.jl RNN performance Machine Learning	11	3173	October 28, 2018
[Optimization] How would you speed the RNN Flux / Zygote code up? Specific Domains knet , flux , optimization , machine-learning , zygote	2	602	July 14, 2020

My Flux Application painfully slow

Related topics