Why is the threaded version of this so much slower than the serial and distributed versions?

Minimal example of my problem:

using ThreadsX
using Statistics

data = [randn(1000) for i in 1:100, j in 1:200]

f(x) = mean(x)

# warmup
f(data[1, 2])

t_serial = @elapsed map(f, data)
t_thread = @elapsed ThreadsX.map(f, data)
t_serial / t_thread # 0.2139645344930473

The exact ratio varies a bit depending on the sizes of the data arrays, but the serial version is always faster. My real function f(x) is much more computationally intensive (it fits a Turing model, which takes ~ 1 sec to run). Looking at both the MWE and my real code, there’s a ton more allocation and garbage collection in the threaded version, which seems to get hung up when collecting the results at the end. Any ideas what’s going on?

Here’s a more realistic MWE. Also compared it with Distributed pmap, which is faster, as expected:

using Distributed

@everywhere using Turing
@everywhere begin

    @model function ExampleModel(x)
        μ ~ Normal(0, 1)
        x .~ Normal(μ, 1)

    function f(x) 
        sample(ExampleModel(x), NUTS(), 100)
     f(randn(5)) # warmup on all cores

data = [randn(10000) for i in 1:10, j in 1:20]
sample(ExampleModel(data[1, 1]), NUTS(), 100)

t_serial = @elapsed map(f, data) # 15.4 s
t_distributed = @elapsed pmap(f, data) # 4.0 s
t_thread = @elapsed ThreadsX.map(f, data) # hangs after running all the models

A few things:

  1. For benchmarking, you should generally use GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language
  2. There’s a lot of console output and the redraw speed of your terminal could be limiting the performance.
  3. Your first example is likely saturating your memory bandwidth, so it’s never going to scale well with cores.
  4. ThreadPools.tmap performs a lot better for me on your tests than ThreadsX.map and completes your 2nd example in 3.1 seconds.
  5. There are trade-offs to consider when using distributed processes or threads. Distributed processes will perform better for longer computations with GC overhead (like your 2nd example). Threads will perform better for tight loops with no allocation and shared data.
1 Like