Hi there,
I am trying to understand why the number of memory allocations increases with multi threading.
I have the following code computing the difference of a vector (it is basically the diff function)
‘’’
function gr_loop(x)
for i=1:length(x) - 1
x[i] = x[i + 1] - x[i];
end
return x
end
function gr_loop_multi(x)
Threads.@threads for i=1:length(x) - 1
x[i] = x[i + 1] - x[i];
end
return x
end
x = rand(1000000);
using BenchmarkTools
println(“Testing simple loop version”) @btime gr_loop(x);
println(“Testing mulithread loop version”) @btime gr_loop_multi(x);
‘’’
The output is
‘’’
Testing simple loop version
178.019 μs (0 allocations: 0 bytes)
Testing mulithread loop version
52.634 μs (46 allocations: 3.97 KiB)
‘’’
The first question is then: why the number of allocations increases in the multithreaded version?
However what is more puzzling is that if i increase the size of x from 1e+6 to 1e+9 the time of the multithreaded version becomes equal to the simple loop ones.
This is the output with 1e+9 elements in x
‘’’
Testing simple loop version
636.210 ms (0 allocations: 0 bytes)
Testing mulithread loop version
650.529 ms (50 allocations: 4.09 KiB)
‘’’
Which is the oppoisite of what one would expect from multithreading (ie more gain with larger vectors).
Any idea of why this is happening? Aslo I am running the code on a macbook pro with 16 cores but using in these cases only 8.
Thanks!
Side note: this code could suffer from a race-condition that leads to possible correctness issues. You’re spawning threads to independently update the values in a given array, but execution order matters: has x[i+1] been updated yet? By parallelizing the loop you’re not getting any guarantees that this executes in-order or perfectly in sync.