I’d like to understand how does the performance of large for loops in julia work? Running the same function on one core or on multiple cores does not give me similar time results.
The computer has 48 physical cores, 400GB memory, there should not be any problem with limits.
Here is the example I got, defining arrays and function for multiplication on 8 cores.
using Distributed
@everywhere M = randn(1000, 1000)
@everywhere V = randn(1000, 500)
@everywhere mult3!(D, A, B, C) = D = A * B * C
@everywhere function test1(M, V)
N1, N2 = size(V)
Mtime = zeros(N2, N2, 1000)
for i in 1:1000
Mtime[:, :, i] = V' * M * V
return Mtime
I have also a function giving the same result only it uses multiply function instead of *
, this should have fewer allocations when I plan to use this in a large setup.
@everywhere function test2(M, V)
N1, N2 = size(V)
Mtime = zeros(N2, N2, 1000)
D = similar(Mtime[:, :, 1]);
for i in 1:1000
Mtime[:, :, i] = mult3!(D, V', M, V)
return Mtime
When I run the functions on one core:
@time test1(M, V);
8.305153 seconds (4.01 k allocations: 7.451 GiB, 2.96% gc time)
@time test2(M, V); # faster, but more allocations
7.343611 seconds (5.01 k allocations: 7.454 GiB, 1.93% gc time)
These single-core results are hopefully correct. However, when I just want to parallelize it and run the same functions on 8 cores. Using pmap
takes a longer execution time.
@time pmap(it -> test1(M, V), 1:8);
43.275477 seconds (210.83 k allocations: 14.910 GiB, 0.73% gc time)
@time pmap(it -> test2(M, V), 1:8);
43.466093 seconds (212.28 k allocations: 14.918 GiB, 0.69% gc time)
What I see is, that for the single core, CPU use is at 800%, but when I parallelize I see 8 times 100% use. My expectations are that the same task at different cores should take similar time as on one core. I understand this is not going to scale linearly but it should not take 7x slower on eight cores than on one?
Also, I don’t see much of benefit using mul!
Is there anything I am missing? Thank you for helping.