In the below dummy code I show that doing the sum over the 4-vectors using broadcasting is an order of magnitude slower than iteration of the 4-vector manually. However, I don’t understand why - and whether I can keep using some form of broadcasting that is just as fast the manual version.

Any ideas?

```
using BenchmarkTools
function func1(arr1, arr2)
for row in axes(arr1, 2)
for i in axes(arr2, 2)
@views @. arr1[:, row] = arr1[:, row] + arr2[:, i]
end
end
return arr1
end
function func2(arr1, arr2)
for row in axes(arr1, 2)
for i in axes(arr2, 2)
for j in 1:4
arr1[j, row] = arr1[j, row] + arr2[j, i]
end
end
end
return arr1
end
arr1 = zeros(4, 100_000)
arr2 = rand(4, 256)
@btime func1($arr1, $arr2) evals=1 samples=10 seconds=100
# 653.253 ms (76800000 allocations: 3.43 GiB)
@btime func2($arr1, $arr2) evals=1 samples=10 seconds=100
# 73.863 ms (0 allocations: 0 bytes)
```