Thank yoou @Oscar_Smith and @Elrod , your explanations clarify a lot of things. Still, I am not entirely sure what is happening hereā¦
Letās start with SIMD, which as you write is not used by regular broadcasting. Why is that? Is this something related to the number of arguments in the broadcast, or a general āfeatureā? I am asking because the drop in performance does depend on the number of vectors added.
Regarding your other point, I understand there are many different ways to perform a broadcasting, but if one does not use the associativity of the sum (which Julia does not by default) I can think only of a single sensible way of doing a broadcasting: just use a single loop and perform the full sum of each element inside the loop. Other options would require multiple passes (i.e., multiple loops), and are likely to be slower. To test the performance, I did a few more tests:
n = 10000; x = rand(n); y = rand(n); z = rand(n); t = rand(n); r = zeros(n);
function test4a!(r, x, y, z, t)
@. r = x + y + z + t
end
function test4b!(r, x, y, z, t)
@turbo @. r = x + y + z + t
end
function test4c!(r, x, y, z, t)
@inbounds for i ā 1:10000
r[i] = x[i] + y[i] + z[i] + t[i]
end
end
function test4d!(r, x, y, z, t)
@inbounds @simd for i ā 1:10000
r[i] = x[i] + y[i] + z[i] + t[i]
end
end
I then measured the performance as usual:
julia> @btime test4a!($r, $x, $y, $z, $t);
8.319 Ī¼s (0 allocations: 0 bytes)
julia> @btime test4b!($r, $x, $y, $z, $t);
4.787 Ī¼s (0 allocations: 0 bytes)
julia> @btime test4c!($r, $x, $y, $z, $t);
4.897 Ī¼s (0 allocations: 0 bytes)
julia> @btime test4d!($r, $x, $y, $z, $t);
4.913 Ī¼s (0 allocations: 0 bytes)
If I have to conclude something, I would say that

LoopVectorization
is just doing the obvious thing, loop over each element.
 SIMD does not help at all: either the macro is ignored, or for some reason I do not understand SIMD cannot be used here. Note that the same happens if I define functions
test3d!
or test2d!
that only add three or two vectors: @simd
is not improving the performance.
 The standard broadcasting does something I still do not understand. Is there a way to understand the way the loop is performed in Julia? That clearly would help me (and hopefully other people) in writing efficient code, without being forced to either guess or to test each single line as I am doing here.