Multithreading increases memory allocations

Hi there,
I am trying to understand why the number of memory allocations increases with multi threading.
I have the following code computing the difference of a vector (it is basically the diff function)
‘’’
function gr_loop(x)
for i=1:length(x) - 1
x[i] = x[i + 1] - x[i];
end
return x
end

function gr_loop_multi(x)
Threads.@threads for i=1:length(x) - 1
x[i] = x[i + 1] - x[i];
end
return x
end

x = rand(1000000);
using BenchmarkTools
println(“Testing simple loop version”)
@btime gr_loop(x);
println(“Testing mulithread loop version”)
@btime gr_loop_multi(x);
‘’’
The output is
‘’’
Testing simple loop version
178.019 μs (0 allocations: 0 bytes)
Testing mulithread loop version
52.634 μs (46 allocations: 3.97 KiB)
‘’’
The first question is then: why the number of allocations increases in the multithreaded version?
However what is more puzzling is that if i increase the size of x from 1e+6 to 1e+9 the time of the multithreaded version becomes equal to the simple loop ones.
This is the output with 1e+9 elements in x
‘’’
Testing simple loop version
636.210 ms (0 allocations: 0 bytes)
Testing mulithread loop version
650.529 ms (50 allocations: 4.09 KiB)
‘’’
Which is the oppoisite of what one would expect from multithreading (ie more gain with larger vectors).
Any idea of why this is happening? Aslo I am running the code on a macbook pro with 16 cores but using in these cases only 8.
Thanks!

Multithreading causes allocations.

julia> @btime fetch(Threads.@spawn 1 + 1)
  976.800 ns (5 allocations: 534 bytes)
2

I cannot replicate your observation (Julia 1.92 on M1 Macbook Pro; 8 threads)

julia> println("Testing simple loop version")
       @btime gr_loop($x);
       println("Testing mulithread loop version")
       @btime gr_loop_multi($x);
Testing simple loop version
  210.584 μs (0 allocations: 0 bytes)
Testing mulithread loop version
  83.875 μs (42 allocations: 4.34 KiB)

julia> x = rand(1000000 * 10);

julia> println("Testing simple loop version")
       @btime gr_loop($x);
       println("Testing mulithread loop version")
       @btime gr_loop_multi($x);
Testing simple loop version
  2.160 ms (0 allocations: 0 bytes)
Testing mulithread loop version
  903.041 μs (44 allocations: 4.41 KiB)

julia> x = rand(1_000_000 * 100);

julia> println("Testing simple loop version")
       @btime gr_loop($x);
       println("Testing mulithread loop version")
       @btime gr_loop_multi($x);
Testing simple loop version
  21.734 ms (0 allocations: 0 bytes)
Testing mulithread loop version
  9.626 ms (46 allocations: 4.47 KiB)

julia> x = rand(1_000_000 * 1000);

julia> println("Testing simple loop version")
       @btime gr_loop($x);
       println("Testing mulithread loop version")
       @btime gr_loop_multi($x);
Testing simple loop version
  216.451 ms (0 allocations: 0 bytes)
Testing mulithread loop version
  93.721 ms (49 allocations: 4.56 KiB)

Side note: this code could suffer from a race-condition that leads to possible correctness issues. You’re spawning threads to independently update the values in a given array, but execution order matters: has x[i+1] been updated yet? By parallelizing the loop you’re not getting any guarantees that this executes in-order or perfectly in sync.

2 Likes