Flux and cpu cores

When I run the following code

using Flux
n = 100_000
p = 50
x = rand(Float32, p, n)
y = rand(Float32, n)    
trdata = Flux.Data.DataLoader(x, y, batchsize=100)
m = Chain(Dense(p, 100), Dense(100,100), Dense(100,1))
loss(x, y) = Flux.mse(m(x), y)
@time Flux.@epochs 10 Flux.train!(loss, Flux.params(m), trdata, Flux.ADAM())

on my laptop, all cores/threads are working at 100% which is somewhat surprising to me. Is this as expected? Threads.nthreads() returns 1. I use Julia 1.4.1 with Flux 0.10.4 on Ubuntu 18 with an Intel® Core™ i7-6600U (2 cores and 2 threads/core) CPU.

Suppose I want to train my model multiple times with different random initial weights, what would be the recommended way to do this?

It is because openblas is multi-threaded

1 Like

Ok, thanks. Does this imply that I should avoid Threads.@threads for ... end in order to train several nets simultaneously?

I think that if you use multi-threadding, it would automatically set the number of threads for OpenBlas to one. But in your case, it can still be a win.

You have to do this manually. (When comparing the charts, note the different scales on the x-axis!)

3 Likes

Thank you for the comments. I will make a basic comparison in a few days.

I ran a few tests which confirm that BLAS.set_num_threads(1) should be set. On my system the code ran ~10 times faster. Please see below for codes and results

# start Julia with JULIA_NUM_THREADS=1 julia
using Flux
using BenchmarkTools
using LinearAlgebra
n = 100_000
p = 50
x = rand(Float32, p, n)
y = rand(Float32, n)    
trdata = Flux.Data.DataLoader(x, y, batchsize=100)
m = [Chain(Dense(p, 100), Dense(100,100), Dense(100,1)) for i in 1:4]
@btime for i in 1:4
    loss(x, y) = Flux.mse(m[i](x), y)
    Flux.@epochs 1 Flux.train!(loss, Flux.params(m[i]), trdata, Flux.ADAM())
end
#  6.286 s (1992500 allocations: 2.24 GiB)

# start Julia with JULIA_NUM_THREADS=4 julia
using Flux
using BenchmarkTools
using LinearAlgebra
n = 100_000
p = 50
x = rand(Float32, p, n)
y = rand(Float32, n)    
trdata = Flux.Data.DataLoader(x, y, batchsize=100)
m = [Chain(Dense(p, 100), Dense(100,100), Dense(100,1)) for i in 1:4]
@btime Threads.@threads for i in 1:4
    loss(x, y) = Flux.mse(m[i](x), y)
    Flux.@epochs 1 Flux.train!(loss, Flux.params(m[i]), trdata, Flux.ADAM())
end
#  10.864 s (1992523 allocations: 2.24 GiB)  

# start Julia with JULIA_NUM_THREADS=4 julia
using Flux
using BenchmarkTools
using LinearAlgebra
BLAS.set_num_threads(1)
n = 100_000
p = 50
x = rand(Float32, p, n)
y = rand(Float32, n)    
trdata = Flux.Data.DataLoader(x, y, batchsize=100)
m = [Chain(Dense(p, 100), Dense(100,100), Dense(100,1)) for i in 1:4]
@btime Threads.@threads for i in 1:4
    loss(x, y) = Flux.mse(m[i](x), y)
    Flux.@epochs 1 Flux.train!(loss, Flux.params(m[i]), trdata, Flux.ADAM())
end
#  1.076 s (1992515 allocations: 2.24 GiB)