Vectorise summation over index in nested for loop

Given a function

using LinearAlgebra

function test(arr1, arr2, arr3, g)
     return 1 ./ (arr1 .+ arr2 .+ arr3 .+ im*g)
end

N1 = 100000; arr1 = range(0,10,N1);
N2 = 5000; arr2 = range(0,10,N2);
N3 = 5000; arr3 = range(0,10,N3);

opar = zeros(ComplexF64, N1);
##---Wrong---
# for ij = 1:N2
#      Threads.@threads for jk = 1:N3
#           opar .= opar .+ test(arr1, arr2[ij], arr3[jk], 1);
#      end
# end
##---
for ij = 1:N2
     for jk = 1:N3
          opar .+= 1 ./ (arr1 .+ arr2[ij] .+ arr3[jk] .+ im*g);
     end
end
 

Is there a way to vectorize the innermost sum, over jk? In my application, N1 is much larger than N2 and N3, so I was hoping to treat that as a broadcast dimension.
I have explicitly multithreaded the for loop as, without it, my linux system monitor app was showing that only 1 cpu was active. I have an i7 13700K, 8 performance + 8 efficient cores. I am starting Julia with julia -t16.

Note that this is not fully in-place: it is equivalent to:

tmp = test(arr1, arr2[ij], arr3[jk], 1)
opar .= opar .+ tmp

so it still allocates tmp.

I also don’t understand your use of @threads here. You have a race condition where different threads are writing to opar simultaneously.

Corrected. You are right about the race condition, an oversight on my part.
So @views a .+= b, where a,b are vectors, should perform an in place addition?

a .+= b doesn’t allocate, with or without @views. (@views makes no difference unless you are slicing, e.g. a[1:n] .+= b, equivalent to a[1:n] .= a[1:n] .+ b, will allocate without @views.)

The problem here is not the .+=, it is that your function call test(...) allocates a new vector for its return value.

If you did

opar .+= 1 ./ (arr1 .+ arr2 .+ arr3 .+ im*1)

i.e. if you inlined the test code, then it wouldn’t allocate. But dot-call loop fusion does not cross over function calls — the thing to understand is that the loop fusion is not a compiler optimization, it is a syntactic guarantee, which means it has to do the same thing regardless of what test(...) does internally.

In this case, instead of inlining test it should also work to just define it on scalars, i.e.,

function test(x1, x2, x3, g)
     return 1 / (x1 + x2 + x3 + im*g)
end

and broadcast the call

opar .= opar .+ test.(arr1, arr2[ij], arr3[jk], 1)  # No semicolon necessary here

If in doubt, one can also spell out the loop in Julia – which is fast and basically the same as what dot-fusion obtains

for k in eachindex(arr1, opar)
    opar[k] += test(arr1[k], arr2[ij], arr3[jk], 1)
end

This has the advantage of being more explicit, i.e., making it harder to hide unexpected allocations etc. On the other hand, it’s longer and broadcasting is often a better compromise between performance and conciseness imho.

You may also want to check the performance tips and start by putting your code into a function.

I don’t think you need the dots here, only at the callsite.

BTW, without checking, there’s a chance that inv(z) is faster than 1/z.

Of course, forgot to remove them, thanks a lot.
Have edited my post accordingly.