I’m running the program below on a 32 cpu/64 thread system without much of anything else running on it. If I use anything beyond 16 cores then the execution time in the second run is effectively flat. This is true both when I using a precompiled system image and when I don’t (though a bit more so when using a precompiled system image for reasons I don’t understand). What am I missing?
using Distributed
@everywhere using Optim, LinearAlgebra
@everywhere const R = 8000
@everywhere const d = 40
@everywhere function once(x::Int64)
for r = 1: ((x<0) ? 1 : R)
function Ω(θ::Vector{Float64})::Float64
dot(θ,θ) * 0.5
end
function dΩ!(g::Vector{Float64}, θ::Vector{Float64})
g[:] = θ
end
function ddΩ!(H::Matrix{Float64}, θ::Vector{Float64})
H[:,:] .= 0.0
for i = 1:d
H[i,i] = 1.0
end
end
Optim.optimize(Ω::Function, dΩ!::Function, ddΩ!::Function, ones(Float64,d), NewtonTrustRegion())
end
end
function doit()
@time pmap(once, -64:-1)
@time pmap(once, 1:192)
end
doit()
Thank you. This is indeed related, though the cause is different.
If I run things in parallel and OpenBLAS also runs things in parallel in each of the processes that I have going then I’m effectively using many more threads than I had indicated. Perhaps I should compile a separate version from source with a maximum of one thread to use it like I had intended.
And curiously, my OpenBLAS seems to max out at 8 threads. Hmmm.
This is a bit of speculation, but it could be that threaded BLAS is simply not enabled. See the default keyword argument enable_threaded_blas = false for addprocs, documentation here: Distributed Computing · The Julia Language
But, because you don’t explicitly call addprocs, I’m not sure of the default behavior.