Minimal example of my problem:
using ThreadsX
using Statistics
data = [randn(1000) for i in 1:100, j in 1:200]
f(x) = mean(x)
# warmup
f(data[1, 2])
t_serial = @elapsed map(f, data)
t_thread = @elapsed ThreadsX.map(f, data)
t_serial / t_thread # 0.2139645344930473
The exact ratio varies a bit depending on the sizes of the data arrays, but the serial version is always faster. My real function f(x)
is much more computationally intensive (it fits a Turing model, which takes ~ 1 sec to run). Looking at both the MWE and my real code, there’s a ton more allocation and garbage collection in the threaded version, which seems to get hung up when collecting the results at the end. Any ideas what’s going on?