For loop in function and multiplication of larger matrices, slow speed in parallel

luboshanus · November 19, 2019, 7:50am

Hello,

I’d like to understand how does the performance of large for loops in julia work? Running the same function on one core or on multiple cores does not give me similar time results.

The computer has 48 physical cores, 400GB memory, there should not be any problem with limits.

Here is the example I got, defining arrays and function for multiplication on 8 cores.

using Distributed
addprocs(8)
@everywhere M = randn(1000, 1000)
@everywhere V = randn(1000, 500)
@everywhere mult3!(D, A, B, C) = D = A * B * C

@everywhere function test1(M, V)
	N1, N2 = size(V)
	Mtime = zeros(N2, N2, 1000)

	for i in 1:1000
	    Mtime[:, :, i] =  V' * M * V
	end
	return Mtime
end

I have also a function giving the same result only it uses multiply function instead of *, this should have fewer allocations when I plan to use this in a large setup.

@everywhere function test2(M, V)
	N1, N2 = size(V)
	Mtime = zeros(N2, N2, 1000)
	D = similar(Mtime[:, :, 1]);

	for i in 1:1000
	    Mtime[:, :, i] =  mult3!(D, V', M, V)
	end
	return Mtime
end

When I run the functions on one core:

@time test1(M, V);
8.305153 seconds (4.01 k allocations: 7.451 GiB, 2.96% gc time)
@time test2(M, V); # faster, but more allocations
7.343611 seconds (5.01 k allocations: 7.454 GiB, 1.93% gc time)

These single-core results are hopefully correct. However, when I just want to parallelize it and run the same functions on 8 cores. Using pmap takes a longer execution time.

@time pmap(it -> test1(M, V), 1:8);
43.275477 seconds (210.83 k allocations: 14.910 GiB, 0.73% gc time)
@time pmap(it -> test2(M, V), 1:8);
43.466093 seconds (212.28 k allocations: 14.918 GiB, 0.69% gc time)

What I see is, that for the single core, CPU use is at 800%, but when I parallelize I see 8 times 100% use. My expectations are that the same task at different cores should take similar time as on one core. I understand this is not going to scale linearly but it should not take 7x slower on eight cores than on one?

Also, I don’t see much of benefit using mul! function.

Is there anything I am missing? Thank you for helping.

baggepinnen · November 22, 2019, 11:24am

You can try rewriting it in terms of mul!, or use TensorOperations.jl which should probably be efficient for this kind of tasks.

luboshanus · November 22, 2019, 1:08pm

Ok, thank you for the reply! I will try that.

And additionally, let me ask about the parallel pmap and the speed I don’t understand. Why the times are this long and allocations that many?

baggepinnen · November 22, 2019, 1:42pm

Pmap needs to send the data between processes. Further, everything you do allocates memory,

this is not inplace

this is not

each invocation of the mapped function returns a very large array

Topic		Replies	Views
Speed up parallel maximum across columns Performance parallel , distributed , loops	1	402	August 18, 2020
Matrix multiplication is slower when multithreading in Julia Performance question , multithreading , linearalgebra	13	4073	January 21, 2022
Comparing matrix multiplication between Julia and C Performance question , matrices	2	882	July 7, 2020
Parallel sampling Performance	5	444	June 4, 2021
Using Julia with @parallel pmap or blank makes no difference in speed. Julia at Scale	3	849	March 22, 2018

For loop in function and multiplication of larger matrices, slow speed in parallel

Related topics