using LinearAlgebra
function test(arr1, arr2, arr3, g)
return 1 ./ (arr1 .+ arr2 .+ arr3 .+ im*g)
end
N1 = 100000; arr1 = range(0,10,N1);
N2 = 5000; arr2 = range(0,10,N2);
N3 = 5000; arr3 = range(0,10,N3);
opar = zeros(ComplexF64, N1);
##---Wrong---
# for ij = 1:N2
# Threads.@threads for jk = 1:N3
# opar .= opar .+ test(arr1, arr2[ij], arr3[jk], 1);
# end
# end
##---
for ij = 1:N2
for jk = 1:N3
opar .+= 1 ./ (arr1 .+ arr2[ij] .+ arr3[jk] .+ im*g);
end
end
Is there a way to vectorize the innermost sum, over jk? In my application, N1 is much larger than N2 and N3, so I was hoping to treat that as a broadcast dimension.
I have explicitly multithreaded the for loop as, without it, my linux system monitor app was showing that only 1 cpu was active. I have an i7 13700K, 8 performance + 8 efficient cores. I am starting Julia with julia -t16.
Corrected. You are right about the race condition, an oversight on my part.
So @views a .+= b, where a,b are vectors, should perform an in place addition?
a .+= b doesn’t allocate, with or without @views. (@views makes no difference unless you are slicing, e.g. a[1:n] .+= b, equivalent to a[1:n] .= a[1:n] .+ b, will allocate without @views.)
The problem here is not the .+=, it is that your function call test(...) allocates a new vector for its return value.
If you did
opar .+= 1 ./ (arr1 .+ arr2 .+ arr3 .+ im*1)
i.e. if you inlined the test code, then it wouldn’t allocate. But dot-call loop fusion does not cross over function calls — the thing to understand is that the loop fusion is not a compiler optimization, it is a syntactic guarantee, which means it has to do the same thing regardless of what test(...) does internally.
In this case, instead of inlining test it should also work to just define it on scalars, i.e.,
function test(x1, x2, x3, g)
return 1 / (x1 + x2 + x3 + im*g)
end
and broadcast the call
opar .= opar .+ test.(arr1, arr2[ij], arr3[jk], 1) # No semicolon necessary here
If in doubt, one can also spell out the loop in Julia – which is fast and basically the same as what dot-fusion obtains
for k in eachindex(arr1, opar)
opar[k] += test(arr1[k], arr2[ij], arr3[jk], 1)
end
This has the advantage of being more explicit, i.e., making it harder to hide unexpected allocations etc. On the other hand, it’s longer and broadcasting is often a better compromise between performance and conciseness imho.
You may also want to check the performance tips and start by putting your code into a function.