Why doesn't multithreading help here?

Hey all—here’s an update with some conclusive information. In short, I think a lot of the concerns that I thought had to do with inefficient multithreading were actually about Julia’s fancy new partr not playing nicely with OpenBLAS’s partr (?). At least, I think I’ve seen some github discussion on how the composable multithreading’s interaction with BLAS is not totally worked out yet.

Setting BLAS.set_num_threads(1) per @elrod’s suggestion and using two threads for ThreadPools.jl, I get good speedups, and my ThreadPools.jl log looks like this:
poolplot_blas1

If I don’t do that, it looks like this:
poolplot_noblas1
So it does seem like there is plenty of waiting that happens. @tro3, thanks for the excellent logging functionality. I never would have expected it would be so easy to diagnose this issue.

In the process of exploring the most efficient way to build matrices with expensive functions, I ended up making kind of a shootout between many convenient thread-parallel things. I also added a couple new expensive functions: one that just sleeps for a moment then returns 1.0 and another that uses a const global object that has been made a subtype of Function, which is a very common code pattern for me. In this code, I make it a numerical integrator using FastGaussQuadrature. Here is the code for the full study:

using ThreadPools, BenchmarkTools, Random, Transducers, ThreadsX
using FastGaussQuadrature, StaticArrays

struct Integrator{N} <: Function
  nodes::SVector{N, Float64}
  weights::SVector{N, Float64}
  function Integrator{N}() where{N}
    no, wt = FastGaussQuadrature.gausslegendre(N)
    new(SVector{N, Float64}(no), SVector{N, Float64}(wt))
  end
end

function (K::Integrator{N})((j,k)) where{N}
  fun(x) = 1.0 + (j/k)*x - (0.5*j/k)*x^2 + (k/j)*x^3
  for l in 1:15 # waste time to make this function more expensive.
    sum(x->l*x[1]*fun(x[2]), zip(K.weights, K.nodes))
  end
  sum(x->x[1]*fun(x[2]), zip(K.weights, K.nodes))
end

#BLAS.set_num_threads(1)
function kernel_eval_logdet((j,k))
  A = randn(MersenneTwister(j*k), 100,100)
  M = A'A
  logdet(M)
end

function kernel_eval_sleep((j,k))
  sleep(0.000001)
  1.0
end

const kernel_eval_quad = Integrator{1024}()

function serial_comp_build(fn)
  out = [fn((j,k)) for j in 1:256, k in 1:256]
  nothing
end

function tmap_build(fn)
  out = ThreadPools.tmap(fn, Iterators.product(1:256, 1:256))
  nothing
end

function bmap_build(fn)
  out = ThreadPools.bmap(fn, Iterators.product(1:256, 1:256))
  nothing
end

function transducers_build(fn)
  out = tcollect(Map(x->fn((x[1], x[2]))), 
                 collect(Iterators.product(1:256,1:256)))
  nothing
end

function threadsx_build(fn)
  ThreadsX.map(fn, collect(Iterators.product(1:256,1:256)))
  nothing
end

function manual_build_loop(fn)
  out = zeros(256,256)
  chunks = collect(Iterators.partition(CartesianIndices((256,256)), 
                                       Threads.nthreads()))
  Threads.@threads for chunk in chunks
    @simd for ix in chunk
      @inbounds out[CartesianIndex(ix)] = fn((ix[1],ix[2]))
    end
  end
  nothing
end

function manual_build_procs(fn)
  getcol(k) = [fn((j,k))  for j in 1:256]
  cols = [Threads.@spawn getcol(k) for k in 1:256]
  out  = reduce(hcat, ThreadPools.tmap(fetch, cols))
  nothing
end

# Shootout:
const buildnames  = (:serial_comp,
                     :threadpools_tmap, :threadpools_bmap,
                     :transducers_tcollect, :threadsx_map, 
                     :manual_forloop, :manual_procvec)
const kernelnames = (:linalg, :sleep, :quadrature)

kernel_info = zip((kernel_eval_logdet, kernel_eval_sleep, kernel_eval_quad), 
                  kernelnames)
build_info  = zip((serial_comp_build, 
                   tmap_build, bmap_build,
                   transducers_build, 
                   threadsx_build,
                   manual_build_loop, manual_build_procs),
                  buildnames)

for ((kernelfn, kname), (buildfn, bname)) in Iterators.product(kernel_info, build_info)
  print("$bname/$kname:")
  @btime $buildfn($kernelfn)
end

Running on a work computer with a much better CPU that has 4 physical cores (or threads?) and asking Julia to use 6 (hyper?) threads (as in, JULIA_NUM_THREADS=6 julia -O3 ....), here is the result with BLAS.set_num_threads(1):

serial_comp/linalg:  10.142 s (983042 allocations: 15.94 GiB)
serial_comp/sleep:  83.117 s (330413 allocations: 10.38 MiB)
serial_comp/quadrature:  10.777 s (2 allocations: 512.08 KiB)
threadpools_tmap/linalg:  3.243 s (983082 allocations: 15.94 GiB)
threadpools_tmap/sleep:  16.383 s (330452 allocations: 11.38 MiB)
threadpools_tmap/quadrature:  2.909 s (46 allocations: 1.54 MiB)
threadpools_bmap/linalg:  3.734 s (983080 allocations: 15.94 GiB)
threadpools_bmap/sleep:  21.845 s (330451 allocations: 11.38 MiB)
threadpools_bmap/quadrature:  3.779 s (43 allocations: 1.54 MiB)
transducers_tcollect/linalg:  3.113 s (983303 allocations: 15.94 GiB)
transducers_tcollect/sleep:  21.587 s (281433 allocations: 11.39 MiB)
transducers_tcollect/quadrature:  2.902 s (252 allocations: 3.04 MiB)
threadsx_map/linalg:  3.535 s (985406 allocations: 15.94 GiB)
threadsx_map/sleep:  2.246 s (274187 allocations: 13.04 MiB)
threadsx_map/quadrature:  2.899 s (2270 allocations: 4.68 MiB)
manual_forloop/linalg:  4.528 s (983065 allocations: 15.94 GiB)
manual_forloop/sleep:  16.384 s (330435 allocations: 11.63 MiB)
manual_forloop/quadrature:  2.905 s (25 allocations: 1.77 MiB)
manual_procvec/linalg:  3.591 s (985372 allocations: 16.00 GiB)
manual_procvec/sleep:  293.469 ms (265561 allocations: 72.80 MiB)
manual_procvec/quadrature:  2.919 s (2331 allocations: 69.02 MiB)

Without setting BLAS.set_num_threads(1), here is the output:

serial_comp/linalg:  11.161 s (983042 allocations: 15.94 GiB)
serial_comp/sleep:  84.323 s (330413 allocations: 10.38 MiB)
serial_comp/quadrature:  10.775 s (2 allocations: 512.08 KiB)
threadpools_tmap/linalg:  7.241 s (983084 allocations: 15.94 GiB)
threadpools_tmap/sleep:  16.383 s (330452 allocations: 11.38 MiB)
threadpools_tmap/quadrature:  2.909 s (45 allocations: 1.54 MiB)
threadpools_bmap/linalg:  7.842 s (983081 allocations: 15.94 GiB)
threadpools_bmap/sleep:  21.846 s (330451 allocations: 11.38 MiB)
threadpools_bmap/quadrature:  3.777 s (43 allocations: 1.54 MiB)
transducers_tcollect/linalg:  7.362 s (983305 allocations: 15.94 GiB)
transducers_tcollect/sleep:  17.340 s (314308 allocations: 12.39 MiB)
transducers_tcollect/quadrature:  2.910 s (252 allocations: 3.04 MiB)
threadsx_map/linalg:  7.375 s (985406 allocations: 15.94 GiB)
threadsx_map/sleep:  2.211 s (274133 allocations: 13.03 MiB)
threadsx_map/quadrature:  2.910 s (2272 allocations: 4.68 MiB)
manual_forloop/linalg:  7.222 s (983067 allocations: 15.94 GiB)
manual_forloop/sleep:  16.384 s (330435 allocations: 11.63 MiB)
manual_forloop/quadrature:  2.910 s (27 allocations: 1.77 MiB)
manual_procvec/linalg:  7.349 s (985372 allocations: 16.00 GiB)
manual_procvec/sleep:  291.663 ms (265562 allocations: 72.78 MiB)
manual_procvec/quadrature:  2.888 s (2332 allocations: 69.02 MiB)

In general, I would say that the takeaway is that with the BLAS threading fixed, the speedup for all three different expensive kernels is very satisfying. And the manual_build_procs function being competitive is really a testament to the composable multithreading that Julia offers. So that’s pretty cool.

3 Likes