Hey all—here’s an update with some conclusive information. In short, I think a lot of the concerns that I thought had to do with inefficient multithreading were actually about Julia’s fancy new partr not playing nicely with OpenBLAS’s partr (?). At least, I think I’ve seen some github discussion on how the composable multithreading’s interaction with BLAS is not totally worked out yet.
Setting BLAS.set_num_threads(1)
per @elrod’s suggestion and using two threads for ThreadPools.jl
, I get good speedups, and my ThreadPools.jl
log looks like this:
If I don’t do that, it looks like this:
So it does seem like there is plenty of waiting that happens. @tro3, thanks for the excellent logging functionality. I never would have expected it would be so easy to diagnose this issue.
In the process of exploring the most efficient way to build matrices with expensive functions, I ended up making kind of a shootout between many convenient thread-parallel things. I also added a couple new expensive functions: one that just sleeps for a moment then returns 1.0
and another that uses a const global object that has been made a subtype of Function
, which is a very common code pattern for me. In this code, I make it a numerical integrator using FastGaussQuadrature
. Here is the code for the full study:
using ThreadPools, BenchmarkTools, Random, Transducers, ThreadsX
using FastGaussQuadrature, StaticArrays
struct Integrator{N} <: Function
nodes::SVector{N, Float64}
weights::SVector{N, Float64}
function Integrator{N}() where{N}
no, wt = FastGaussQuadrature.gausslegendre(N)
new(SVector{N, Float64}(no), SVector{N, Float64}(wt))
end
end
function (K::Integrator{N})((j,k)) where{N}
fun(x) = 1.0 + (j/k)*x - (0.5*j/k)*x^2 + (k/j)*x^3
for l in 1:15 # waste time to make this function more expensive.
sum(x->l*x[1]*fun(x[2]), zip(K.weights, K.nodes))
end
sum(x->x[1]*fun(x[2]), zip(K.weights, K.nodes))
end
#BLAS.set_num_threads(1)
function kernel_eval_logdet((j,k))
A = randn(MersenneTwister(j*k), 100,100)
M = A'A
logdet(M)
end
function kernel_eval_sleep((j,k))
sleep(0.000001)
1.0
end
const kernel_eval_quad = Integrator{1024}()
function serial_comp_build(fn)
out = [fn((j,k)) for j in 1:256, k in 1:256]
nothing
end
function tmap_build(fn)
out = ThreadPools.tmap(fn, Iterators.product(1:256, 1:256))
nothing
end
function bmap_build(fn)
out = ThreadPools.bmap(fn, Iterators.product(1:256, 1:256))
nothing
end
function transducers_build(fn)
out = tcollect(Map(x->fn((x[1], x[2]))),
collect(Iterators.product(1:256,1:256)))
nothing
end
function threadsx_build(fn)
ThreadsX.map(fn, collect(Iterators.product(1:256,1:256)))
nothing
end
function manual_build_loop(fn)
out = zeros(256,256)
chunks = collect(Iterators.partition(CartesianIndices((256,256)),
Threads.nthreads()))
Threads.@threads for chunk in chunks
@simd for ix in chunk
@inbounds out[CartesianIndex(ix)] = fn((ix[1],ix[2]))
end
end
nothing
end
function manual_build_procs(fn)
getcol(k) = [fn((j,k)) for j in 1:256]
cols = [Threads.@spawn getcol(k) for k in 1:256]
out = reduce(hcat, ThreadPools.tmap(fetch, cols))
nothing
end
# Shootout:
const buildnames = (:serial_comp,
:threadpools_tmap, :threadpools_bmap,
:transducers_tcollect, :threadsx_map,
:manual_forloop, :manual_procvec)
const kernelnames = (:linalg, :sleep, :quadrature)
kernel_info = zip((kernel_eval_logdet, kernel_eval_sleep, kernel_eval_quad),
kernelnames)
build_info = zip((serial_comp_build,
tmap_build, bmap_build,
transducers_build,
threadsx_build,
manual_build_loop, manual_build_procs),
buildnames)
for ((kernelfn, kname), (buildfn, bname)) in Iterators.product(kernel_info, build_info)
print("$bname/$kname:")
@btime $buildfn($kernelfn)
end
Running on a work computer with a much better CPU that has 4 physical cores (or threads?) and asking Julia to use 6 (hyper?) threads (as in, JULIA_NUM_THREADS=6 julia -O3 ....
), here is the result with BLAS.set_num_threads(1)
:
serial_comp/linalg: 10.142 s (983042 allocations: 15.94 GiB)
serial_comp/sleep: 83.117 s (330413 allocations: 10.38 MiB)
serial_comp/quadrature: 10.777 s (2 allocations: 512.08 KiB)
threadpools_tmap/linalg: 3.243 s (983082 allocations: 15.94 GiB)
threadpools_tmap/sleep: 16.383 s (330452 allocations: 11.38 MiB)
threadpools_tmap/quadrature: 2.909 s (46 allocations: 1.54 MiB)
threadpools_bmap/linalg: 3.734 s (983080 allocations: 15.94 GiB)
threadpools_bmap/sleep: 21.845 s (330451 allocations: 11.38 MiB)
threadpools_bmap/quadrature: 3.779 s (43 allocations: 1.54 MiB)
transducers_tcollect/linalg: 3.113 s (983303 allocations: 15.94 GiB)
transducers_tcollect/sleep: 21.587 s (281433 allocations: 11.39 MiB)
transducers_tcollect/quadrature: 2.902 s (252 allocations: 3.04 MiB)
threadsx_map/linalg: 3.535 s (985406 allocations: 15.94 GiB)
threadsx_map/sleep: 2.246 s (274187 allocations: 13.04 MiB)
threadsx_map/quadrature: 2.899 s (2270 allocations: 4.68 MiB)
manual_forloop/linalg: 4.528 s (983065 allocations: 15.94 GiB)
manual_forloop/sleep: 16.384 s (330435 allocations: 11.63 MiB)
manual_forloop/quadrature: 2.905 s (25 allocations: 1.77 MiB)
manual_procvec/linalg: 3.591 s (985372 allocations: 16.00 GiB)
manual_procvec/sleep: 293.469 ms (265561 allocations: 72.80 MiB)
manual_procvec/quadrature: 2.919 s (2331 allocations: 69.02 MiB)
Without setting BLAS.set_num_threads(1)
, here is the output:
serial_comp/linalg: 11.161 s (983042 allocations: 15.94 GiB)
serial_comp/sleep: 84.323 s (330413 allocations: 10.38 MiB)
serial_comp/quadrature: 10.775 s (2 allocations: 512.08 KiB)
threadpools_tmap/linalg: 7.241 s (983084 allocations: 15.94 GiB)
threadpools_tmap/sleep: 16.383 s (330452 allocations: 11.38 MiB)
threadpools_tmap/quadrature: 2.909 s (45 allocations: 1.54 MiB)
threadpools_bmap/linalg: 7.842 s (983081 allocations: 15.94 GiB)
threadpools_bmap/sleep: 21.846 s (330451 allocations: 11.38 MiB)
threadpools_bmap/quadrature: 3.777 s (43 allocations: 1.54 MiB)
transducers_tcollect/linalg: 7.362 s (983305 allocations: 15.94 GiB)
transducers_tcollect/sleep: 17.340 s (314308 allocations: 12.39 MiB)
transducers_tcollect/quadrature: 2.910 s (252 allocations: 3.04 MiB)
threadsx_map/linalg: 7.375 s (985406 allocations: 15.94 GiB)
threadsx_map/sleep: 2.211 s (274133 allocations: 13.03 MiB)
threadsx_map/quadrature: 2.910 s (2272 allocations: 4.68 MiB)
manual_forloop/linalg: 7.222 s (983067 allocations: 15.94 GiB)
manual_forloop/sleep: 16.384 s (330435 allocations: 11.63 MiB)
manual_forloop/quadrature: 2.910 s (27 allocations: 1.77 MiB)
manual_procvec/linalg: 7.349 s (985372 allocations: 16.00 GiB)
manual_procvec/sleep: 291.663 ms (265562 allocations: 72.78 MiB)
manual_procvec/quadrature: 2.888 s (2332 allocations: 69.02 MiB)
In general, I would say that the takeaway is that with the BLAS threading fixed, the speedup for all three different expensive kernels is very satisfying. And the manual_build_procs
function being competitive is really a testament to the composable multithreading that Julia offers. So that’s pretty cool.