For loop in function and multiplication of larger matrices, slow speed in parallel


I’d like to understand how does the performance of large for loops in julia work? Running the same function on one core or on multiple cores does not give me similar time results.

The computer has 48 physical cores, 400GB memory, there should not be any problem with limits.

Here is the example I got, defining arrays and function for multiplication on 8 cores.

using Distributed
@everywhere M = randn(1000, 1000)
@everywhere V = randn(1000, 500)
@everywhere mult3!(D, A, B, C) = D = A * B * C

@everywhere function test1(M, V)
	N1, N2 = size(V)
	Mtime = zeros(N2, N2, 1000)

	for i in 1:1000
	    Mtime[:, :, i] =  V' * M * V
	return Mtime

I have also a function giving the same result only it uses multiply function instead of *, this should have fewer allocations when I plan to use this in a large setup.

@everywhere function test2(M, V)
	N1, N2 = size(V)
	Mtime = zeros(N2, N2, 1000)
	D = similar(Mtime[:, :, 1]);

	for i in 1:1000
	    Mtime[:, :, i] =  mult3!(D, V', M, V)
	return Mtime

When I run the functions on one core:

@time test1(M, V);
8.305153 seconds (4.01 k allocations: 7.451 GiB, 2.96% gc time)
@time test2(M, V); # faster, but more allocations
7.343611 seconds (5.01 k allocations: 7.454 GiB, 1.93% gc time)

These single-core results are hopefully correct. However, when I just want to parallelize it and run the same functions on 8 cores. Using pmap takes a longer execution time.

@time pmap(it -> test1(M, V), 1:8);
43.275477 seconds (210.83 k allocations: 14.910 GiB, 0.73% gc time)
@time pmap(it -> test2(M, V), 1:8);
43.466093 seconds (212.28 k allocations: 14.918 GiB, 0.69% gc time)

What I see is, that for the single core, CPU use is at 800%, but when I parallelize I see 8 times 100% use. My expectations are that the same task at different cores should take similar time as on one core. I understand this is not going to scale linearly but it should not take 7x slower on eight cores than on one?

Also, I don’t see much of benefit using mul! function.

Is there anything I am missing? Thank you for helping.

You can try rewriting it in terms of mul!, or use TensorOperations.jl which should probably be efficient for this kind of tasks.

Ok, thank you for the reply! I will try that.

And additionally, let me ask about the parallel pmap and the speed I don’t understand. Why the times are this long and allocations that many?

Pmap needs to send the data between processes. Further, everything you do allocates memory,

this is not inplace

this is not

each invocation of the mapped function returns a very large array