Why doesn't multithreading help here?

cgeoga · August 21, 2020, 4:37pm

Hey all—here’s an update with some conclusive information. In short, I think a lot of the concerns that I thought had to do with inefficient multithreading were actually about Julia’s fancy new partr not playing nicely with OpenBLAS’s partr (?). At least, I think I’ve seen some github discussion on how the composable multithreading’s interaction with BLAS is not totally worked out yet.

Setting BLAS.set_num_threads(1) per @elrod’s suggestion and using two threads for ThreadPools.jl, I get good speedups, and my ThreadPools.jl log looks like this:
poolplot_blas1

If I don’t do that, it looks like this:
poolplot_noblas1
So it does seem like there is plenty of waiting that happens. @tro3, thanks for the excellent logging functionality. I never would have expected it would be so easy to diagnose this issue.

In the process of exploring the most efficient way to build matrices with expensive functions, I ended up making kind of a shootout between many convenient thread-parallel things. I also added a couple new expensive functions: one that just sleeps for a moment then returns 1.0 and another that uses a const global object that has been made a subtype of Function, which is a very common code pattern for me. In this code, I make it a numerical integrator using FastGaussQuadrature. Here is the code for the full study:

using ThreadPools, BenchmarkTools, Random, Transducers, ThreadsX
using FastGaussQuadrature, StaticArrays

struct Integrator{N} <: Function
  nodes::SVector{N, Float64}
  weights::SVector{N, Float64}
  function Integrator{N}() where{N}
    no, wt = FastGaussQuadrature.gausslegendre(N)
    new(SVector{N, Float64}(no), SVector{N, Float64}(wt))
  end
end

function (K::Integrator{N})((j,k)) where{N}
  fun(x) = 1.0 + (j/k)*x - (0.5*j/k)*x^2 + (k/j)*x^3
  for l in 1:15 # waste time to make this function more expensive.
    sum(x->l*x[1]*fun(x[2]), zip(K.weights, K.nodes))
  end
  sum(x->x[1]*fun(x[2]), zip(K.weights, K.nodes))
end

#BLAS.set_num_threads(1)
function kernel_eval_logdet((j,k))
  A = randn(MersenneTwister(j*k), 100,100)
  M = A'A
  logdet(M)
end

function kernel_eval_sleep((j,k))
  sleep(0.000001)
  1.0
end

const kernel_eval_quad = Integrator{1024}()

function serial_comp_build(fn)
  out = [fn((j,k)) for j in 1:256, k in 1:256]
  nothing
end

function tmap_build(fn)
  out = ThreadPools.tmap(fn, Iterators.product(1:256, 1:256))
  nothing
end

function bmap_build(fn)
  out = ThreadPools.bmap(fn, Iterators.product(1:256, 1:256))
  nothing
end

function transducers_build(fn)
  out = tcollect(Map(x->fn((x[1], x[2]))), 
                 collect(Iterators.product(1:256,1:256)))
  nothing
end

function threadsx_build(fn)
  ThreadsX.map(fn, collect(Iterators.product(1:256,1:256)))
  nothing
end

function manual_build_loop(fn)
  out = zeros(256,256)
  chunks = collect(Iterators.partition(CartesianIndices((256,256)), 
                                       Threads.nthreads()))
  Threads.@threads for chunk in chunks
    @simd for ix in chunk
      @inbounds out[CartesianIndex(ix)] = fn((ix[1],ix[2]))
    end
  end
  nothing
end

function manual_build_procs(fn)
  getcol(k) = [fn((j,k))  for j in 1:256]
  cols = [Threads.@spawn getcol(k) for k in 1:256]
  out  = reduce(hcat, ThreadPools.tmap(fetch, cols))
  nothing
end

# Shootout:
const buildnames  = (:serial_comp,
                     :threadpools_tmap, :threadpools_bmap,
                     :transducers_tcollect, :threadsx_map, 
                     :manual_forloop, :manual_procvec)
const kernelnames = (:linalg, :sleep, :quadrature)

kernel_info = zip((kernel_eval_logdet, kernel_eval_sleep, kernel_eval_quad), 
                  kernelnames)
build_info  = zip((serial_comp_build, 
                   tmap_build, bmap_build,
                   transducers_build, 
                   threadsx_build,
                   manual_build_loop, manual_build_procs),
                  buildnames)

for ((kernelfn, kname), (buildfn, bname)) in Iterators.product(kernel_info, build_info)
  print("$bname/$kname:")
  @btime $buildfn($kernelfn)
end

Running on a work computer with a much better CPU that has 4 physical cores (or threads?) and asking Julia to use 6 (hyper?) threads (as in, JULIA_NUM_THREADS=6 julia -O3 ....), here is the result with BLAS.set_num_threads(1):

serial_comp/linalg:  10.142 s (983042 allocations: 15.94 GiB)
serial_comp/sleep:  83.117 s (330413 allocations: 10.38 MiB)
serial_comp/quadrature:  10.777 s (2 allocations: 512.08 KiB)
threadpools_tmap/linalg:  3.243 s (983082 allocations: 15.94 GiB)
threadpools_tmap/sleep:  16.383 s (330452 allocations: 11.38 MiB)
threadpools_tmap/quadrature:  2.909 s (46 allocations: 1.54 MiB)
threadpools_bmap/linalg:  3.734 s (983080 allocations: 15.94 GiB)
threadpools_bmap/sleep:  21.845 s (330451 allocations: 11.38 MiB)
threadpools_bmap/quadrature:  3.779 s (43 allocations: 1.54 MiB)
transducers_tcollect/linalg:  3.113 s (983303 allocations: 15.94 GiB)
transducers_tcollect/sleep:  21.587 s (281433 allocations: 11.39 MiB)
transducers_tcollect/quadrature:  2.902 s (252 allocations: 3.04 MiB)
threadsx_map/linalg:  3.535 s (985406 allocations: 15.94 GiB)
threadsx_map/sleep:  2.246 s (274187 allocations: 13.04 MiB)
threadsx_map/quadrature:  2.899 s (2270 allocations: 4.68 MiB)
manual_forloop/linalg:  4.528 s (983065 allocations: 15.94 GiB)
manual_forloop/sleep:  16.384 s (330435 allocations: 11.63 MiB)
manual_forloop/quadrature:  2.905 s (25 allocations: 1.77 MiB)
manual_procvec/linalg:  3.591 s (985372 allocations: 16.00 GiB)
manual_procvec/sleep:  293.469 ms (265561 allocations: 72.80 MiB)
manual_procvec/quadrature:  2.919 s (2331 allocations: 69.02 MiB)

Without setting BLAS.set_num_threads(1), here is the output:

serial_comp/linalg:  11.161 s (983042 allocations: 15.94 GiB)
serial_comp/sleep:  84.323 s (330413 allocations: 10.38 MiB)
serial_comp/quadrature:  10.775 s (2 allocations: 512.08 KiB)
threadpools_tmap/linalg:  7.241 s (983084 allocations: 15.94 GiB)
threadpools_tmap/sleep:  16.383 s (330452 allocations: 11.38 MiB)
threadpools_tmap/quadrature:  2.909 s (45 allocations: 1.54 MiB)
threadpools_bmap/linalg:  7.842 s (983081 allocations: 15.94 GiB)
threadpools_bmap/sleep:  21.846 s (330451 allocations: 11.38 MiB)
threadpools_bmap/quadrature:  3.777 s (43 allocations: 1.54 MiB)
transducers_tcollect/linalg:  7.362 s (983305 allocations: 15.94 GiB)
transducers_tcollect/sleep:  17.340 s (314308 allocations: 12.39 MiB)
transducers_tcollect/quadrature:  2.910 s (252 allocations: 3.04 MiB)
threadsx_map/linalg:  7.375 s (985406 allocations: 15.94 GiB)
threadsx_map/sleep:  2.211 s (274133 allocations: 13.03 MiB)
threadsx_map/quadrature:  2.910 s (2272 allocations: 4.68 MiB)
manual_forloop/linalg:  7.222 s (983067 allocations: 15.94 GiB)
manual_forloop/sleep:  16.384 s (330435 allocations: 11.63 MiB)
manual_forloop/quadrature:  2.910 s (27 allocations: 1.77 MiB)
manual_procvec/linalg:  7.349 s (985372 allocations: 16.00 GiB)
manual_procvec/sleep:  291.663 ms (265562 allocations: 72.78 MiB)
manual_procvec/quadrature:  2.888 s (2332 allocations: 69.02 MiB)

In general, I would say that the takeaway is that with the BLAS threading fixed, the speedup for all three different expensive kernels is very satisfying. And the manual_build_procs function being competitive is really a testament to the composable multithreading that Julia offers. So that’s pretty cool.

Topic		Replies	Views
Julia Threads vs BLAS threads Internals & Design	16	10956	July 26, 2018
Multi-threading of julia-1.8.5 does not improve speed when combined with BLAS New to Julia	17	1459	May 1, 2023
Data structures for threaded computing Performance	23	2843	October 23, 2019
Poor performance while multithreading (Julia 1.0) Performance multithreading	28	3975	February 11, 2019
Regarding the multithreaded performance of OpenBLAS Performance blas , multithreading	7	5432	January 31, 2022

Why doesn't multithreading help here?

Related topics