Vectorise summation over index in nested for loop

eveningsilverfox · October 10, 2024, 4:06pm

Given a function

using LinearAlgebra

function test(arr1, arr2, arr3, g)
     return 1 ./ (arr1 .+ arr2 .+ arr3 .+ im*g)
end

N1 = 100000; arr1 = range(0,10,N1);
N2 = 5000; arr2 = range(0,10,N2);
N3 = 5000; arr3 = range(0,10,N3);

opar = zeros(ComplexF64, N1);
##---Wrong---
# for ij = 1:N2
#      Threads.@threads for jk = 1:N3
#           opar .= opar .+ test(arr1, arr2[ij], arr3[jk], 1);
#      end
# end
##---
for ij = 1:N2
     for jk = 1:N3
          opar .+= 1 ./ (arr1 .+ arr2[ij] .+ arr3[jk] .+ im*g);
     end
end

Is there a way to vectorize the innermost sum, over jk? In my application, N1 is much larger than N2 and N3, so I was hoping to treat that as a broadcast dimension.
I have explicitly multithreaded the for loop as, without it, my linux system monitor app was showing that only 1 cpu was active. I have an i7 13700K, 8 performance + 8 efficient cores. I am starting Julia with julia -t16.

stevengj · October 10, 2024, 4:47pm

Note that this is not fully in-place: it is equivalent to:

tmp = test(arr1, arr2[ij], arr3[jk], 1)
opar .= opar .+ tmp

so it still allocates tmp.

I also don’t understand your use of @threads here. You have a race condition where different threads are writing to opar simultaneously.

eveningsilverfox · October 10, 2024, 5:00pm

Corrected. You are right about the race condition, an oversight on my part.
So @views a .+= b, where a,b are vectors, should perform an in place addition?

stevengj · October 10, 2024, 5:06pm

a .+= b doesn’t allocate, with or without @views. (@views makes no difference unless you are slicing, e.g. a[1:n] .+= b, equivalent to a[1:n] .= a[1:n] .+ b, will allocate without @views.)

The problem here is not the .+=, it is that your function call test(...) allocates a new vector for its return value.

If you did

opar .+= 1 ./ (arr1 .+ arr2 .+ arr3 .+ im*1)

i.e. if you inlined the test code, then it wouldn’t allocate. But dot-call loop fusion does not cross over function calls — the thing to understand is that the loop fusion is not a compiler optimization, it is a syntactic guarantee, which means it has to do the same thing regardless of what test(...) does internally.

bertschi · October 10, 2024, 7:47pm

In this case, instead of inlining test it should also work to just define it on scalars, i.e.,

function test(x1, x2, x3, g)
     return 1 / (x1 + x2 + x3 + im*g)
end

and broadcast the call

opar .= opar .+ test.(arr1, arr2[ij], arr3[jk], 1)  # No semicolon necessary here

If in doubt, one can also spell out the loop in Julia – which is fast and basically the same as what dot-fusion obtains

for k in eachindex(arr1, opar)
    opar[k] += test(arr1[k], arr2[ij], arr3[jk], 1)
end

This has the advantage of being more explicit, i.e., making it harder to hide unexpected allocations etc. On the other hand, it’s longer and broadcasting is often a better compromise between performance and conciseness imho.

You may also want to check the performance tips and start by putting your code into a function.

DNF · October 10, 2024, 10:33pm

I don’t think you need the dots here, only at the callsite.

DNF · October 10, 2024, 10:51pm

BTW, without checking, there’s a chance that inv(z) is faster than 1/z.

bertschi · October 10, 2024, 11:33pm

Of course, forgot to remove them, thanks a lot.
Have edited my post accordingly.

Topic		Replies	Views
For loop optimization Performance performance , memory-allocation , loops	3	1657	October 31, 2021
Memory usage in nested loops New to Julia multithreading , memory-allocation	2	763	December 10, 2019
What's the problem with this simple multi-thread code? General Usage question	17	1101	March 11, 2022
Question on multithreading/vectorizing loops New to Julia simd , multithreading , threads	9	672	March 22, 2023
Multithreading for nested loops Performance multithreading	42	12368	January 20, 2022

Vectorise summation over index in nested for loop

Related topics