Parallel computing using Optim

Hello,

I’m running the program below on a 32 cpu/64 thread system without much of anything else running on it. If I use anything beyond 16 cores then the execution time in the second run is effectively flat. This is true both when I using a precompiled system image and when I don’t (though a bit more so when using a precompiled system image for reasons I don’t understand). What am I missing?

using Distributed 
@everywhere using Optim, LinearAlgebra

@everywhere const R = 8000
@everywhere const d = 40

@everywhere function once(x::Int64)
    for r = 1: ((x<0) ? 1 : R)
        function Ω(θ::Vector{Float64})::Float64
            dot(θ,θ) * 0.5
        end
        function dΩ!(g::Vector{Float64}, θ::Vector{Float64})
            g[:] = θ
        end
        function ddΩ!(H::Matrix{Float64}, θ::Vector{Float64})
            H[:,:] .= 0.0
            for i = 1:d
                H[i,i] = 1.0
            end
        end
        Optim.optimize(Ω::Function, dΩ!::Function, ddΩ!::Function, ones(Float64,d), NewtonTrustRegion())
    end
end

function doit()
    @time pmap(once, -64:-1)
    @time pmap(once, 1:192)
end

doit()

Could this be because dot calls BLAS, and Julia is built with OpenBlas limited to use 16 threads, see https://github.com/JuliaLang/julia/blob/master/deps/blas.mk?

If this is the case, then you could try writing the dot call as an explicit loop, or using MKL.jl, or building OpenBlas with more threads.

Thank you. This is indeed related, though the cause is different.

If I run things in parallel and OpenBLAS also runs things in parallel in each of the processes that I have going then I’m effectively using many more threads than I had indicated. Perhaps I should compile a separate version from source with a maximum of one thread to use it like I had intended.

And curiously, my OpenBLAS seems to max out at 8 threads. Hmmm.

This is a bit of speculation, but it could be that threaded BLAS is simply not enabled. See the default keyword argument enable_threaded_blas = false for addprocs, documentation here: Distributed Computing · The Julia Language

But, because you don’t explicitly call addprocs, I’m not sure of the default behavior.

You can control the number of threads used by BLAS at run-time with the BLAS.set_num_threads function from LinearAlgebra:

julia> using LinearAlgebra

julia> LinearAlgebra.BLAS.set_num_threads(1)
1 Like

Thanks! I didn’t know that was possible.