# Multithreading an embarrassingly parallel algorithm increases garbage collection

Hi,
Hope this MWE is clear enough.
I’m trying to implement a certain algorithm in which I fill out a matrix (about 5000x6000) with known dimensions column by column. The calculation of each column is completely independent. In multi-threaded mode I run 24 threads on 12 physical cores.
When i switch from a loop to a multi-threaded loop I go from about 11 minutes to 2.5 minutes. A simple calculation shows that I should get somewhere around 1 minute (if I simply multiply the time per row by the number of rows, I should get 30sec but I’m giving it some margin).

Indeed, looking at the memory consumption, I can observe triangular behavior of the signar USED_RAM(time) which corresponds to the garbage collector. A look at some benchmark confirmed that I roam around 60% gc() time.

The MWE is an attempt to provide some tests which show this behaviour. One could replace the sin() calculation with rand().

``````n=5000
m=6000

function generate!(a,α)
for i=1:n
a[i]=sin(α*i)
end
end

function generate(α)
a=fill(0f0,n)
for i=1:n
a[i]=sin(α*i)
end
return a
end

function test_1(m=40)
A=fill(0f0,n,m)
for i in 1:m
x=@view A[:,i]
generate!(x,i)
end
return A
end

function test_2(m=40)
A=fill(0f0,n,m)
for i in 1:m
x=@view A[:,i]
x.=generate(i)
end
return A
end

function test_3(m=40)
A=Array{Float32,2}(undef,n,0)
for i in 1:m
A=hcat(A,generate(i))
end
return A
end

function test_4(m=40)
A=fill(0f0,n,m)
x=@view A[:,i]
generate!(x,i)
end
return A
end

function test_5(m=40)
A=fill(0f0,n,m)
x=@view A[:,i]
x.=generate(i)
end
return A
end

function test_6(m=40)
A=Array{Float32,2}(undef,n,0)
A=hcat(A,generate(i))
end
return A
end

test_1(2);
test_2(2);
test_3(2);
test_4(2);
test_5(2);
test_6(2);

print("test_1")
print("test_2")
#print("test_3") #Commented test_3 because it's soo slow so I put in three m values to get the trend.
@time(test_6(m)); #About 6.34sec, 73% gc and wrong number of columns
``````

The tests are paired by modulo 3: test_4 is test_1 in multithreading, test_5 is test_2, test_6 is test_3.

The case I’m most interested in is test_4 since this is what I’m currently using.

I’m not particularly well versed in multi-threading but I think A is in a shared mutable state which is annoying since it probably locks writing while another thread writes. I don’t know how to make a matrix be writable by multiple threads in guaranteed different parts of it so if someone knows, this is part 2 of the question.

To summarize:

1. Why do I get so much garbage collection and can I avoid it in any way? For instance what if instead of launching a threaded for loop I separated the columns between threads and in each thread ran a for loop over the associated columns to reuse the memory?
2. Is it possible to write to a matrix simultaneously in different columns (this is mostly useless, I won’t get much performance out of it but it’s something I can’t figure out).

Best regards,
Andre

I did not look too closely, but right off the bat, you need to get rid of the global variables, like `n`. An important key to getting good parallel performance is to get good serial performance, in particular getting rid of allocations.

Check out the performance tips, where getting rid of globals is actually the first tip(!)

The global variables are only here for the MWE. This is very much not my real algorithm, this is just an MWE. I could package the variables in functions etc… but this would just obfuscate the problem and not change anything.

No, getting rid of the globals is completely necessary for the MWE to have any value. Otherwise it will obscure the other issues. It is important that your MWE actually illustrates the problem.

If there are no globals in the real algorithm, there shouldn’t be in the MWE either.

You’re not the first person to say “It’s just an MWE”, but I think that’s a misunderstanding. Making a good MWE is super important, and it can be really hard, and a lot of work. It means understanding your problem thoroughly, peeling away the unnecessary parts, and keeping all the important ones.

Making a good MWE often solves the problem itself, before you even get to the point of asking the question.

5 Likes

It would be OK for the MWE if the global was `const`, but otherwise, that will be the performance problem.

Very well, here is a version which does not use any global variables at all.

``````@assert Threads.nthreads()>1
function generate!(a,α,n)
for i=1:n
a[i]=sin(α*i)
end
end

function generate(α,n)
a=fill(0f0,n)
for i=1:n
a[i]=sin(α*i)
end
return a
end

function test_1(m,n)
A=fill(0f0,n,m)
for i in 1:m
x=@view A[:,i]
generate!(x,i,n)
end
return A
end

function test_2(m,n)
A=fill(0f0,n,m)
for i in 1:m
x=@view A[:,i]
x.=generate(i,n)
end
return A
end

function test_3(m,n)
A=Array{Float32,2}(undef,n,0)
for i in 1:m
A=hcat(A,generate(i,n))
end
return A
end

function test_4(m,n)
A=fill(0f0,n,m)
x=@view A[:,i]
generate!(x,i,n)
end
return A
end

function test_5(m,n)
A=fill(0f0,n,m)
x=@view A[:,i]
x.=generate(i,n)
end
return A
end

function test_6(m,n)
A=Array{Float32,2}(undef,n,0)
A=hcat(A,generate(i,n))
end
return A
end

function test(m,n)
test_1(2,2);
test_2(2,2);
test_3(2,2);
test_4(2,2);
test_5(2,2);
test_6(2,2);
#Previous part to avoid compilation times spoiling stuff

print("test_1")
@time (test_1(m,n));
print("test_2")
@time (test_2(m,n));
#print("test_3") #Commented test_3 because it's soo slow so I put in three m values to get the trend.
#@time(test_3(1000));
#@time(test_3(2000));
#@time(test_3(3000));
@time (test_4(m,n));
@time (test_5(m,n));
@time (test_6(m,n));
end

test(20000,5000)
``````

Please note that I’m using a different computer right now, so the times are different but the 5x jump in gc() time remains.

I reckon that the GC is due to calling these functions back to back. I am not sure if you are familiarized with the package BenchmarkTools but it provides the macro `@benchmark` that calls your functions multiple times in order to reduce the effects of the environment around it.
This is the output of calling `@benchmark test_4(20000,5000)`:

``````  memory estimate:  381.47 MiB
allocs estimate:  24
--------------
minimum time:     1.038 s (0.07% GC)
median time:      1.087 s (0.00% GC)
mean time:        1.093 s (0.37% GC)
maximum time:     1.146 s (0.00% GC)
--------------
samples:          5
evals/sample:     1
``````

I am running this on julia 1.6.0-rc1 on a quad core laptop.

Are you using multi-threading? As in, starting Julia with the correct argument/exported variable.
Because when I run the @benchmark macro, I get

``````julia> @benchmark test_1(20000,5000)
BenchmarkTools.Trial:
memory estimate:  381.47 MiB
allocs estimate:  2
--------------
minimum time:     2.181 s (0.24% GC)
median time:      2.185 s (0.24% GC)
mean time:        2.191 s (0.23% GC)
maximum time:     2.206 s (0.24% GC)
--------------
samples:          3
evals/sample:     1
``````

and

``````julia> @benchmark test_4(20000,5000)
BenchmarkTools.Trial:
memory estimate:  381.48 MiB
allocs estimate:  66
--------------
minimum time:     371.378 ms (1.18% GC)
median time:      436.371 ms (4.67% GC)
mean time:        497.245 ms (8.87% GC)
maximum time:     957.140 ms (0.54% GC)
--------------
samples:          11
evals/sample:     1
``````

Yea, I forgot to include the result of calling a single-threaded version

``````BenchmarkTools.Trial:
memory estimate:  381.47 MiB
allocs estimate:  2
--------------
minimum time:     3.537 s (0.01% GC)
median time:      3.559 s (0.01% GC)
mean time:        3.559 s (0.01% GC)
maximum time:     3.581 s (0.02% GC)
--------------
samples:          2
evals/sample:     1
``````

The GC time increases does increase a bit but it does feel like its a threading overhead

This is very confusing. I can’t quite figure out an MWE where the effect is the same on all PCs.
To be clear, on my real algorithm, I get around 60% GC time in multithreading an 10% in single threaded on my work PC (intel xeon). The numbers are just a bit lower on my home pc with a ryzen 3600X but the ratio remains the same.

For the MWE the ratio still remains high but very clearly on different PCs the base numbers are vastly different.

I had the hypothesis that in multi-threaded mode, each thread makes an entire copy of the matrix A to write to. Then the matrices are merged. But that is illogical since it just moves the merge conflict further down the line. Also it doesn’t account for the difference between different PCs.

I am a bit lost in all this. I’ll try to make an MWE with less tests and try to understand where the difference comes from.

Currently I cannot reproduce the problem here (4 threads):

``````julia> test(20000,5000);
test_1  2.970390 seconds (2 allocations: 381.470 MiB)
test_2  3.104447 seconds (40.00 k allocations: 765.076 MiB, 0.86% gc time)
test_4 (threaded)  0.897082 seconds (26 allocations: 381.473 MiB)
test_5 (threaded)  0.976054 seconds (40.03 k allocations: 765.079 MiB, 3.12% gc time)

julia> test(20000,5000);
test_1  2.971575 seconds (2 allocations: 381.470 MiB)
test_2  3.195127 seconds (40.00 k allocations: 765.076 MiB, 4.01% gc time)
test_4 (threaded)  0.901151 seconds (25 allocations: 381.473 MiB)
test_5 (threaded)  1.006308 seconds (40.03 k allocations: 765.079 MiB, 6.08% gc time)

julia> test(20000,5000);
test_1  3.015130 seconds (2 allocations: 381.470 MiB, 1.67% gc time)
test_2  3.071584 seconds (40.00 k allocations: 765.076 MiB, 0.49% gc time)
test_4 (threaded)  0.918677 seconds (26 allocations: 381.473 MiB, 1.86% gc time)
test_5 (threaded)  0.959630 seconds (40.03 k allocations: 765.079 MiB, 1.45% gc time)

``````

I didn’t run test 3 and 6 because they were taking too long. Although note that for those you are allocating within the loops, thus if the matrix is to remain contiguous in memory I have no idea what the threads will do.

One test I would do is to allocate the matrix outside the functions, so the functions are allocation-free and I would expect no garbage collection at all then.

Edit: preallocating the output matrix I get:

``````julia> test(20000,5000);
test_1  2.866260 seconds
test_2  3.001166 seconds (40.00 k allocations: 383.606 MiB, 0.74% gc time)
test_4 (threaded)  0.815466 seconds (24 allocations: 3.094 KiB)
test_5 (threaded)  0.837658 seconds (40.02 k allocations: 383.609 MiB)

``````

tests 2 and 5 of course are still allocating inside `generate`, so some GC is expected, of course. But at least here the rest seems fine.

1 Like

If I may suggest something: 10% of GC is too much anyway. I had some cases where I got that kind of GC collection and multi-threading filled completely the memory, ending up in crashes. To solve that I had to forget multi-threading for a while and figure out where the allocations in my program where occurring. I finally found some non-trivial (for me) type instabilities, solved them, and after that everything went smoothly. I have some simple notes on how to track the allocations (I prefer the “manual” method in general). And one excelent tip I recently got is to check for type-instabilities of inner functions using

``````@show Main.@code_warntype f(x)
``````

inside the outer functions of the code. Something like this:

``````function test_5(m,n,A)