Why doesn't multithreading help here?

Hello—

A very common computational issue for me is that I want to assemble kernel matrices with pretty expensive kernel functions. I have tried various ways to do this efficiently in parallel, but I can’t seem to write the assembly code in a way that actually gives me a speedup at all, let alone what one might expect from a job that at least intuitively is very well suited to parallelization. Consider this MWE that creates a very silly kernel function and tries to assemble a matrix with it, which I ran with
JULIA_NUM_THREADS=3 julia -q -O3 test.jl:

using ThreadPools, BenchmarkTools, Random

function kernel_eval((j,k))
  A = randn(MersenneTwister(j*k), 100,100)
  M = A'A
  logdet(M)
end

function serial_build()
  out = [kernel_eval((j,k)) for j in 1:256, k in 1:256]
  nothing
end

function tmap_build()
  out = ThreadPools.tmap(x->kernel_eval(x), Iterators.product(1:256, 1:256))
  nothing
end

function bmap_build()
  out = ThreadPools.bmap(x->kernel_eval(x), Iterators.product(1:256, 1:256))
  nothing
end

function tfor_build()
  out = zeros(256,256)
  ThreadPools.tforeach(x->setindex!(out, kernel_eval(x), CartesianIndex(x)),
          Iterators.product(1:256, 1:256))
  nothing
end

function bfor_build()
  out = zeros(256,256)
  ThreadPools.bforeach(x->setindex!(out, kernel_eval(x), CartesianIndex(x)),
          Iterators.product(1:256, 1:256))
  nothing
end

println("Serial:")
@btime serial_build() # 16.460 s

println("tmap:")
@btime tmap_build() # 38.482 s

println("bmap:")
@btime bmap_build() # 17.216 s

println("tfor:")
@btime tfor_build() # 40.416 s

println("bfor:")
@btime bfor_build() # 17.167 s

Here are the specs for my machine, with Julia installed from the Fedora copr repo:

Julia Version 1.5.0
Commit 96786e22cc* (2020-08-01 23:44 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
      "Fedora release 32 (Thirty Two)"
  uname: Linux 5.7.15-200.fc32.x86_64 #1 SMP Tue Aug 11 16:36:14 UTC 2020 x86_64 x86_64
  CPU: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz: 
              speed         user         nice          sys         idle          irq
       #1  1618 MHz     404410 s       3579 s     106187 s    2415131 s      11155 s
       #2  1716 MHz     400420 s       4600 s      93772 s    2430624 s      13429 s
       
  Memory: 7.653873443603516 GB (5698.8828125 MB free)
  Uptime: 29632.0 sec
  Load Avg:  0.46875  0.48779296875  0.85400390625
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

As you can see, when I run this, the bmap and bforeach versions are a tiny bit slower than the serial, and the tmap and tforeach versions take more than twice as long as the serial version (!). I wouldn’t think that there needs to be much communication, and the computation takes long enough that I would think other slight overhead costs for parallelism really shouldn’t factor in here. I figure there must be some kind of communication that is preventing the benefit of having several threads. But I’m not sure what it would be. Can somebody help me understand why I’m not getting the speedup that I would think I should be getting?

How fast is your computer? 256 by 256 is not 16s for me.

1 Like

Hey @xiaodai, thanks for the note. My computer is quite slow, and I have hyperthreading disabled, and I use tlp in power-saving settings. So this machine is very much not configured for performance. But should that affect the interpretation of the issue? If you are willing, would you run that script and share all the times? Are the thread-parallel ones at all faster for you?

64 by 64 is more like 16s for me. I am on Windows though and you may have a better matrix lib.

I am not really sure what’s happening. But I am not sure if MersenneTwister is thread friendly. I really dont know.

I also don’t get any speed up after turning off MerseenTwister, so I am not sure why there isn’t a speed up.

It is also possible that the underlying matrix algorithms are multithreaded to start with so you many not encounter much of a speed up. I do got about 0.5s speed up on my 6 core machine though,

using ThreadPools, BenchmarkTools, Random
using LinearAlgebra

function kernel_eval((j,k))
  #A = randn(MersenneTwister(j*k), 100,100)
  A =randn(100, 100)
  M = A'A
  logdet(M)
end


function serial_build()
  out = [kernel_eval((j,k)) for j in 1:64, k in 1:64]
  nothing
end
@time serial_build() # 16.460

function threads_build()
    out = Array{Float64, 2}(undef, 64, 64)
    Threads.@threads for jk in 1:64*64
        j, k = divrem(jk-1, 64)
        j += 1
        k += 1
        @inbounds out[j, k] = kernel_eval((j,k))
    end
    nothing
end

@time threads_build()
1 Like

Thanks for looking into it. The linear algebra stuff is all tangential, though—I just wanted an expensive function to fill in values of a matrix with. In any case, it’s definitely just one thread doing all the things inside kernel_eval, if I’m not mistaken, so I don’t think that should be confounding anything (although I could be wrong).

Intuitively, at least, I think this is the most parallelizable job that could be: you don’t need to pass around any data, you can fill in the matrix in any order that you want, and none of the elements require any information from any other computation. So I really would like to understand where the extra thread overhead comes from.

I think what’s happening is that in some way everything is still waiting on the primary thread, or somehow needs to communicate with it. The functions using ThreadPools.bforeach or bmap are the only ones that are even competitive with the purely serial version. @tro3, could you provide some insight on what might be happening?

EDIT: @xiaodai, I see your point now about multithreaded linear algebra potentially being confounding. Sorry to respond before thinking long enough. But setting OPENBLAS_NUM_THREADS and OMP_NUM_THREADS to 1 did not change my results in the above benchmarks, interestingly.

@cgeoga - sorry for the slow reply. For some reason, Julia Discourse has stopped sending me emails.

Have you tried the logging versions of the functions, logbforeach and logbmap? They would give you insight into what order things are progressing in. Just from your execution times, I does look like everything is happening serially, even though you’ve dictated parallel in a fashion that looks right to me. If that is the case, a plot of the log may show this. It also may only indriect - if a job starts and sits waiting for a lock of some sort from Twister, it will still show up as parallel. But it would be much longer than the serial case, which would be a clue.

It also might be worth setting up a more manual loop with @threads. This would be a double-check to see if perhaps the ThreadPools implementation suffered in one of the Julia upgrades.

1 Like

MKL is much faster than OpenBLAS here, so I expect that explains much of the difference in runtimes.

julia> using LinearAlgebra, Random; BLAS.vendor()
:mkl

julia> function kernel_eval((j,k))
         A = randn(MersenneTwister(j*k), 100,100)
         M = A'A
         logdet(M)
       end
kernel_eval (generic function with 1 method)

julia> function serial_build()
         out = [kernel_eval((j,k)) for j in 1:256, k in 1:256]
         nothing
       end
serial_build (generic function with 1 method)

julia> @time kernel_eval((1,1))
  0.021247 seconds (17 allocations: 255.000 KiB)
369.39002383999633

julia> @time serial_build()
  8.975811 seconds (1.11 M allocations: 15.938 GiB, 2.10% gc time)

versus:

julia> using LinearAlgebra, Random; BLAS.vendor()
:openblas64

julia> function kernel_eval((j,k))
         A = randn(MersenneTwister(j*k), 100,100)
         M = A'A
         logdet(M)
       end
kernel_eval (generic function with 1 method)

julia> function serial_build()
         out = [kernel_eval((j,k)) for j in 1:256, k in 1:256]
         nothing
       end
serial_build (generic function with 1 method)

julia> @time kernel_eval((1,1))
  0.002413 seconds (15 allocations: 254.953 KiB)
359.1101919188106

julia> @time serial_build()
116.593972 seconds (983.04 k allocations: 15.935 GiB, 0.18% gc time)

That’s about a 13x speedup from MKL instead of OpenBLAS.

Anyway, to set the number of BLAS threads, use BLAS.set_num_threads(1). Doing that, I get:

julia> @time kernel_eval((1,1))
  0.015403 seconds (17 allocations: 255.000 KiB)
369.39002383999633

julia> @time serial_build()
  9.432848 seconds (1.11 M allocations: 15.938 GiB, 4.60% gc time)

julia> @btime tmap_build()
  596.187 ms (1114270 allocations: 15.94 GiB)

julia> @btime bmap_build()
  609.372 ms (1114271 allocations: 15.94 GiB)

julia> @btime tfor_build()
  596.185 ms (1114274 allocations: 15.94 GiB)

julia> @btime bfor_build()
  609.843 ms (1114276 allocations: 15.94 GiB)
2 Likes

Hey @tro3 and @elrod—Thanks so much for the responses! Also, @tro3, your reply time was amazing. Sorry to cold @ you.

I definitely will look into the logging versions of those functions. Considering the discrepancy between tmap and bmap, your hunch sounds very sensible. I’ll update tomorrow with the results of doing that. I did actually try three or four different manual Threads.@threads for ... loop constructors, though, and all of them were in the ballpark of 40 seconds for me, though, so I think ThreadPools.jl is still kicking butt.

I also gather that doing some small linear algebra was the wrong way to write a contrived slow kernel function (thanks for the note on correctly setting BLAS threads, @elrod!). I should choose something that doesn’t call another complex library. Hopefully that will also demystify things.

Thanks again to everybody for taking the time to look at this. I’ll follow up tomorrow with some more thoughtful testing and logged applications of ThreadPools.jl.

That’s nuts

Hey all—here’s an update with some conclusive information. In short, I think a lot of the concerns that I thought had to do with inefficient multithreading were actually about Julia’s fancy new partr not playing nicely with OpenBLAS’s partr (?). At least, I think I’ve seen some github discussion on how the composable multithreading’s interaction with BLAS is not totally worked out yet.

Setting BLAS.set_num_threads(1) per @elrod’s suggestion and using two threads for ThreadPools.jl, I get good speedups, and my ThreadPools.jl log looks like this:
poolplot_blas1

If I don’t do that, it looks like this:
poolplot_noblas1
So it does seem like there is plenty of waiting that happens. @tro3, thanks for the excellent logging functionality. I never would have expected it would be so easy to diagnose this issue.

In the process of exploring the most efficient way to build matrices with expensive functions, I ended up making kind of a shootout between many convenient thread-parallel things. I also added a couple new expensive functions: one that just sleeps for a moment then returns 1.0 and another that uses a const global object that has been made a subtype of Function, which is a very common code pattern for me. In this code, I make it a numerical integrator using FastGaussQuadrature. Here is the code for the full study:

using ThreadPools, BenchmarkTools, Random, Transducers, ThreadsX
using FastGaussQuadrature, StaticArrays

struct Integrator{N} <: Function
  nodes::SVector{N, Float64}
  weights::SVector{N, Float64}
  function Integrator{N}() where{N}
    no, wt = FastGaussQuadrature.gausslegendre(N)
    new(SVector{N, Float64}(no), SVector{N, Float64}(wt))
  end
end

function (K::Integrator{N})((j,k)) where{N}
  fun(x) = 1.0 + (j/k)*x - (0.5*j/k)*x^2 + (k/j)*x^3
  for l in 1:15 # waste time to make this function more expensive.
    sum(x->l*x[1]*fun(x[2]), zip(K.weights, K.nodes))
  end
  sum(x->x[1]*fun(x[2]), zip(K.weights, K.nodes))
end

#BLAS.set_num_threads(1)
function kernel_eval_logdet((j,k))
  A = randn(MersenneTwister(j*k), 100,100)
  M = A'A
  logdet(M)
end

function kernel_eval_sleep((j,k))
  sleep(0.000001)
  1.0
end

const kernel_eval_quad = Integrator{1024}()

function serial_comp_build(fn)
  out = [fn((j,k)) for j in 1:256, k in 1:256]
  nothing
end

function tmap_build(fn)
  out = ThreadPools.tmap(fn, Iterators.product(1:256, 1:256))
  nothing
end

function bmap_build(fn)
  out = ThreadPools.bmap(fn, Iterators.product(1:256, 1:256))
  nothing
end

function transducers_build(fn)
  out = tcollect(Map(x->fn((x[1], x[2]))), 
                 collect(Iterators.product(1:256,1:256)))
  nothing
end

function threadsx_build(fn)
  ThreadsX.map(fn, collect(Iterators.product(1:256,1:256)))
  nothing
end

function manual_build_loop(fn)
  out = zeros(256,256)
  chunks = collect(Iterators.partition(CartesianIndices((256,256)), 
                                       Threads.nthreads()))
  Threads.@threads for chunk in chunks
    @simd for ix in chunk
      @inbounds out[CartesianIndex(ix)] = fn((ix[1],ix[2]))
    end
  end
  nothing
end

function manual_build_procs(fn)
  getcol(k) = [fn((j,k))  for j in 1:256]
  cols = [Threads.@spawn getcol(k) for k in 1:256]
  out  = reduce(hcat, ThreadPools.tmap(fetch, cols))
  nothing
end

# Shootout:
const buildnames  = (:serial_comp,
                     :threadpools_tmap, :threadpools_bmap,
                     :transducers_tcollect, :threadsx_map, 
                     :manual_forloop, :manual_procvec)
const kernelnames = (:linalg, :sleep, :quadrature)

kernel_info = zip((kernel_eval_logdet, kernel_eval_sleep, kernel_eval_quad), 
                  kernelnames)
build_info  = zip((serial_comp_build, 
                   tmap_build, bmap_build,
                   transducers_build, 
                   threadsx_build,
                   manual_build_loop, manual_build_procs),
                  buildnames)

for ((kernelfn, kname), (buildfn, bname)) in Iterators.product(kernel_info, build_info)
  print("$bname/$kname:")
  @btime $buildfn($kernelfn)
end

Running on a work computer with a much better CPU that has 4 physical cores (or threads?) and asking Julia to use 6 (hyper?) threads (as in, JULIA_NUM_THREADS=6 julia -O3 ....), here is the result with BLAS.set_num_threads(1):

serial_comp/linalg:  10.142 s (983042 allocations: 15.94 GiB)
serial_comp/sleep:  83.117 s (330413 allocations: 10.38 MiB)
serial_comp/quadrature:  10.777 s (2 allocations: 512.08 KiB)
threadpools_tmap/linalg:  3.243 s (983082 allocations: 15.94 GiB)
threadpools_tmap/sleep:  16.383 s (330452 allocations: 11.38 MiB)
threadpools_tmap/quadrature:  2.909 s (46 allocations: 1.54 MiB)
threadpools_bmap/linalg:  3.734 s (983080 allocations: 15.94 GiB)
threadpools_bmap/sleep:  21.845 s (330451 allocations: 11.38 MiB)
threadpools_bmap/quadrature:  3.779 s (43 allocations: 1.54 MiB)
transducers_tcollect/linalg:  3.113 s (983303 allocations: 15.94 GiB)
transducers_tcollect/sleep:  21.587 s (281433 allocations: 11.39 MiB)
transducers_tcollect/quadrature:  2.902 s (252 allocations: 3.04 MiB)
threadsx_map/linalg:  3.535 s (985406 allocations: 15.94 GiB)
threadsx_map/sleep:  2.246 s (274187 allocations: 13.04 MiB)
threadsx_map/quadrature:  2.899 s (2270 allocations: 4.68 MiB)
manual_forloop/linalg:  4.528 s (983065 allocations: 15.94 GiB)
manual_forloop/sleep:  16.384 s (330435 allocations: 11.63 MiB)
manual_forloop/quadrature:  2.905 s (25 allocations: 1.77 MiB)
manual_procvec/linalg:  3.591 s (985372 allocations: 16.00 GiB)
manual_procvec/sleep:  293.469 ms (265561 allocations: 72.80 MiB)
manual_procvec/quadrature:  2.919 s (2331 allocations: 69.02 MiB)

Without setting BLAS.set_num_threads(1), here is the output:

serial_comp/linalg:  11.161 s (983042 allocations: 15.94 GiB)
serial_comp/sleep:  84.323 s (330413 allocations: 10.38 MiB)
serial_comp/quadrature:  10.775 s (2 allocations: 512.08 KiB)
threadpools_tmap/linalg:  7.241 s (983084 allocations: 15.94 GiB)
threadpools_tmap/sleep:  16.383 s (330452 allocations: 11.38 MiB)
threadpools_tmap/quadrature:  2.909 s (45 allocations: 1.54 MiB)
threadpools_bmap/linalg:  7.842 s (983081 allocations: 15.94 GiB)
threadpools_bmap/sleep:  21.846 s (330451 allocations: 11.38 MiB)
threadpools_bmap/quadrature:  3.777 s (43 allocations: 1.54 MiB)
transducers_tcollect/linalg:  7.362 s (983305 allocations: 15.94 GiB)
transducers_tcollect/sleep:  17.340 s (314308 allocations: 12.39 MiB)
transducers_tcollect/quadrature:  2.910 s (252 allocations: 3.04 MiB)
threadsx_map/linalg:  7.375 s (985406 allocations: 15.94 GiB)
threadsx_map/sleep:  2.211 s (274133 allocations: 13.03 MiB)
threadsx_map/quadrature:  2.910 s (2272 allocations: 4.68 MiB)
manual_forloop/linalg:  7.222 s (983067 allocations: 15.94 GiB)
manual_forloop/sleep:  16.384 s (330435 allocations: 11.63 MiB)
manual_forloop/quadrature:  2.910 s (27 allocations: 1.77 MiB)
manual_procvec/linalg:  7.349 s (985372 allocations: 16.00 GiB)
manual_procvec/sleep:  291.663 ms (265562 allocations: 72.78 MiB)
manual_procvec/quadrature:  2.888 s (2332 allocations: 69.02 MiB)

In general, I would say that the takeaway is that with the BLAS threading fixed, the speedup for all three different expensive kernels is very satisfying. And the manual_build_procs function being competitive is really a testament to the composable multithreading that Julia offers. So that’s pretty cool.

3 Likes

Glad to hear those visualizers were useful for someone besides me! :slight_smile:

1 Like

Am I right in understanding that you don’t get a speed unless you set blas threads to 1. But in absolute terms blas thread =1 is slower than leaving BLAS threads where it is.

Hey @xiaodai—I would say that that’s right. If you’re not going to use multiple threads in any of your Julia code, it seems beneficial to let BLAS have them. But at least in these examples making sure that BLAS is not working in thread-parallel and having Julia work in thread parallel led to some serious benefits.