Hi,
Hope this MWE is clear enough.
I’m trying to implement a certain algorithm in which I fill out a matrix (about 5000x6000) with known dimensions column by column. The calculation of each column is completely independent. In multi-threaded mode I run 24 threads on 12 physical cores.
Each column takes about 100ms
When i switch from a loop to a multi-threaded loop I go from about 11 minutes to 2.5 minutes. A simple calculation shows that I should get somewhere around 1 minute (if I simply multiply the time per row by the number of rows, I should get 30sec but I’m giving it some margin).
Indeed, looking at the memory consumption, I can observe triangular behavior of the signar USED_RAM(time) which corresponds to the garbage collector. A look at some benchmark confirmed that I roam around 60% gc() time.
The MWE is an attempt to provide some tests which show this behaviour. One could replace the sin() calculation with rand().
n=5000
m=6000
function generate!(a,α)
for i=1:n
a[i]=sin(α*i)
end
end
function generate(α)
a=fill(0f0,n)
for i=1:n
a[i]=sin(α*i)
end
return a
end
function test_1(m=40)
A=fill(0f0,n,m)
for i in 1:m
x=@view A[:,i]
generate!(x,i)
end
return A
end
function test_2(m=40)
A=fill(0f0,n,m)
for i in 1:m
x=@view A[:,i]
x.=generate(i)
end
return A
end
function test_3(m=40)
A=Array{Float32,2}(undef,n,0)
for i in 1:m
A=hcat(A,generate(i))
end
return A
end
function test_4(m=40)
A=fill(0f0,n,m)
Threads.@threads for i in 1:m
x=@view A[:,i]
generate!(x,i)
end
return A
end
function test_5(m=40)
A=fill(0f0,n,m)
Threads.@threads for i in 1:m
x=@view A[:,i]
x.=generate(i)
end
return A
end
function test_6(m=40)
A=Array{Float32,2}(undef,n,0)
Threads.@threads for i in 1:m
A=hcat(A,generate(i))
end
return A
end
test_1(2);
test_2(2);
test_3(2);
test_4(2);
test_5(2);
test_6(2);
print("test_1")
@time(test_1(m)); #About 11sec, 7% gc
print("test_2")
@time(test_2(m)); #About 12sec, 5% gc
#print("test_3") #Commented test_3 because it's soo slow so I put in three m values to get the trend.
#@time(test_3(1000)); #About 4.7sec, 8% gc
#@time(test_3(2000)); #About 26sec, 8% gc
#@time(test_3(3000)); #About 78sec, 8% gc
print("test_4 (threaded)")
@time(test_4(m)); #About 3.4sec, 65% gc
print("test_5 (threaded)")
@time(test_5(m)); #About 3.6sec, 72% gc
print("test_6 (threaded)")
@time(test_6(m)); #About 6.34sec, 73% gc and wrong number of columns
The tests are paired by modulo 3: test_4 is test_1 in multithreading, test_5 is test_2, test_6 is test_3.
The case I’m most interested in is test_4 since this is what I’m currently using.
I’m not particularly well versed in multi-threading but I think A is in a shared mutable state which is annoying since it probably locks writing while another thread writes. I don’t know how to make a matrix be writable by multiple threads in guaranteed different parts of it so if someone knows, this is part 2 of the question.
To summarize:
- Why do I get so much garbage collection and can I avoid it in any way? For instance what if instead of launching a threaded for loop I separated the columns between threads and in each thread ran a for loop over the associated columns to reuse the memory?
- Is it possible to write to a matrix simultaneously in different columns (this is mostly useless, I won’t get much performance out of it but it’s something I can’t figure out).
Best regards,
Andre