Why is the threaded version of this so much slower than the serial and distributed versions?

ElOceanografo · June 23, 2022, 8:42pm

Minimal example of my problem:

using ThreadsX
using Statistics

data = [randn(1000) for i in 1:100, j in 1:200]

f(x) = mean(x)

# warmup
f(data[1, 2])

t_serial = @elapsed map(f, data)
t_thread = @elapsed ThreadsX.map(f, data)
t_serial / t_thread # 0.2139645344930473

The exact ratio varies a bit depending on the sizes of the data arrays, but the serial version is always faster. My real function f(x) is much more computationally intensive (it fits a Turing model, which takes ~ 1 sec to run). Looking at both the MWE and my real code, there’s a ton more allocation and garbage collection in the threaded version, which seems to get hung up when collecting the results at the end. Any ideas what’s going on?

ElOceanografo · June 23, 2022, 8:47pm

Here’s a more realistic MWE. Also compared it with Distributed pmap, which is faster, as expected:

using Distributed
addprocs()

@everywhere using Turing
@everywhere begin

    @model function ExampleModel(x)
        μ ~ Normal(0, 1)
        x .~ Normal(μ, 1)
    end

    function f(x) 
        sample(ExampleModel(x), NUTS(), 100)
     end
     f(randn(5)) # warmup on all cores
end

data = [randn(10000) for i in 1:10, j in 1:20]
sample(ExampleModel(data[1, 1]), NUTS(), 100)

t_serial = @elapsed map(f, data) # 15.4 s
t_distributed = @elapsed pmap(f, data) # 4.0 s
t_thread = @elapsed ThreadsX.map(f, data) # hangs after running all the models

robsmith11 · June 23, 2022, 10:23pm

A few things:

For benchmarking, you should generally use GitHub - JuliaCI/BenchmarkTools.jl: A benchmarking framework for the Julia language
There’s a lot of console output and the redraw speed of your terminal could be limiting the performance.
Your first example is likely saturating your memory bandwidth, so it’s never going to scale well with cores.
ThreadPools.tmap performs a lot better for me on your tests than ThreadsX.map and completes your 2nd example in 3.1 seconds.
There are trade-offs to consider when using distributed processes or threads. Distributed processes will perform better for longer computations with GC overhead (like your 2nd example). Threads will perform better for tight loops with no allocation and shared data.

Topic		Replies	Views
Multithreaded computation significantly slower Performance question	18	3955	October 17, 2020
Benchmarking Parallel Computing Tools General Usage multithreading , distributed	2	573	February 25, 2021
ThreadsX mapreduce performance Performance multithreading	12	1162	October 18, 2021
Poor speed gain using `pmap` Performance parallel , pmap	17	1191	August 6, 2021
Is Pmap _both_ distributed and threaded? Performance multithreading , distributed , pmap	4	450	December 28, 2021

Why is the threaded version of this so much slower than the serial and distributed versions?

Related topics