Vectorise summation over index in nested for loop

Given a function

using LinearAlgebra

function test(arr1, arr2, arr3, g)
     return 1 ./ (arr1 .+ arr2 .+ arr3 .+ im*g)
end

N1 = 100000; arr1 = range(0,10,N1);
N2 = 5000; arr2 = range(0,10,N2);
N3 = 5000; arr3 = range(0,10,N3);

opar = zeros(ComplexF64, N1);
##---Wrong---
# for ij = 1:N2
#      Threads.@threads for jk = 1:N3
#           opar .= opar .+ test(arr1, arr2[ij], arr3[jk], 1);
#      end
# end
##---
for ij = 1:N2
     for jk = 1:N3
          opar .+= 1 ./ (arr1 .+ arr2[ij] .+ arr3[jk] .+ im*g);
     end
end
 

Is there a way to vectorize the innermost sum, over jk? In my application, N1 is much larger than N2 and N3, so I was hoping to treat that as a broadcast dimension.
I have explicitly multithreaded the for loop as, without it, my linux system monitor app was showing that only 1 cpu was active. I have an i7 13700K, 8 performance + 8 efficient cores. I am starting Julia with julia -t16.

Note that this is not fully in-place: it is equivalent to:

tmp = test(arr1, arr2[ij], arr3[jk], 1)
opar .= opar .+ tmp

so it still allocates tmp.

I also don’t understand your use of @threads here. You have a race condition where different threads are writing to opar simultaneously.

1 Like

Corrected. You are right about the race condition, an oversight on my part.
So @views a .+= b, where a,b are vectors, should perform an in place addition?

a .+= b doesn’t allocate, with or without @views. (@views makes no difference unless you are slicing, e.g. a[1:n] .+= b, equivalent to a[1:n] .= a[1:n] .+ b, will allocate without @views.)

The problem here is not the .+=, it is that your function call test(...) allocates a new vector for its return value.

If you did

opar .+= 1 ./ (arr1 .+ arr2 .+ arr3 .+ im*1)

i.e. if you inlined the test code, then it wouldn’t allocate. But dot-call loop fusion does not cross over function calls — the thing to understand is that the loop fusion is not a compiler optimization, it is a syntactic guarantee, which means it has to do the same thing regardless of what test(...) does internally.

1 Like

In this case, instead of inlining test it should also work to just define it on scalars, i.e.,

function test(x1, x2, x3, g)
     return 1 / (x1 + x2 + x3 + im*g)
end

and broadcast the call

opar .= opar .+ test.(arr1, arr2[ij], arr3[jk], 1)  # No semicolon necessary here

If in doubt, one can also spell out the loop in Julia – which is fast and basically the same as what dot-fusion obtains

for k in eachindex(arr1, opar)
    opar[k] += test(arr1[k], arr2[ij], arr3[jk], 1)
end

This has the advantage of being more explicit, i.e., making it harder to hide unexpected allocations etc. On the other hand, it’s longer and broadcasting is often a better compromise between performance and conciseness imho.

You may also want to check the performance tips and start by putting your code into a function.

I don’t think you need the dots here, only at the callsite.

BTW, without checking, there’s a chance that inv(z) is faster than 1/z.

Of course, forgot to remove them, thanks a lot.
Have edited my post accordingly.