I’ve been struggling with performance issues with my code for weeks. I’ve now narrowed down the primary problem to what seems to be a threading bottleneck in the garbage collector. Sorry, if this has been discussed already.

The following MWE follows the structure of the code, basically subsetting a matrix and performing a calculation on that. Note, that in the MWE I take the sum but in reality I calculate a histogram which in turn is further processed.

```
const THRESHOLD = 0.1
function ensemble_and_threshold!(thresholded::Vector{Bool}, levels::Matrix{Float64}, weights::Vector{Float64})
thresholded .= transpose(sum(levels .* weights, dims=1) .> THRESHOLD )
end
function inner(weights::Vector{Float64}, levels::Matrix{Float64})::Float64
thresholded = Vector{Bool}(undef, size(levels)[2])
ensemble_and_threshold!(thresholded, levels, weights)
s1 = 0.0
s2 = 0.0
np = length(weights)
for p in 1:np
m200p = levels[p, :]
p200p = levels[p, thresholded]
# In reality, the next step in the calculation is significantly more involved
# than suggested here.
s1 += sum(m200p)
s2 += sum(p200p)
end
return s2 / s1
end
function outer(count::Int, levels::Matrix{Float64})::Float64
best = Inf
for _ in 1:count
weights = rand(4)
r = inner(weights, levels)
best = min(best, r)
end
return best
end
function main(n_runs)
N = 100000
np = 4
levels = rand(np, N)
results = zeros(n_runs)
Threads.@threads for i in 1:n_runs
results[i] = outer(10000, levels)
end
# Use results so that it is not optimsed out
println(devnull, results)
end
main(1) # Run once for compilation
for n_runs in 1:4
@time main(n_runs)
end
```

I need to do this ~1000 times and each iteration is independent of the others so it *should* be embarassingly parallel. However, there is a bottleneck that prevents it scaling beyond about 4-8 processors. This is the output from the MWE:

```
➜ julia --threads=4 micro.jl
20.640172 seconds (260.03 k allocations: 97.917 GiB, 14.99% gc time)
27.225064 seconds (520.03 k allocations: 195.826 GiB, 25.38% gc time)
31.511773 seconds (780.04 k allocations: 293.733 GiB, 30.25% gc time)
36.147297 seconds (1.04 M allocations: 391.653 GiB, 35.14% gc time)
```

This was run on an M1 Mac with 4 performance cores. I see the same with Julia on Linux. There is a very slight improvement with 1.10.0-rc2.

I would expect the time to be the same for each run, but the runs are getting longer, nealy 2 twice as long for 4 cores and the relative amount of the time in the garbage collector more than doubles. This suggests that the gc is, at least partially, single threaded.

This problem should be *far* more scalable than it is. What can I do to improve it?