I have a function which operates on tuples of arrays using broadcasting, but I’m rewriting it because it doesn’t play well with ForwardDiff.jl. However, I’ve noticed that version which loops through array entries is slower and produces more allocations.

I’ve included what I hope is a minimal enough working example. I take tuples of arrays `X`

and `Y`

and sum them in `bcast`

and `loop`

.

```
x = ntuple(x->randn(2,4),2)
y = ntuple(x->randn(2,4),2)
function bcast(x,y)
fsum(x,y) = x + y
out = fsum.(x,y)
return out
end
function loop(x,y)
out = ntuple(a->zeros(size(x[1])),length(x))
for i = 1:length(x[1])
xi = (x->x[i]).(x)
yi = (x->x[i]).(y)
fsum!(x,y,out) = out[i] = x + y
fsum!.(xi,yi,out)
end
return out
end
```

Timing each one gives me

```
julia> @btime bcast($x,$y)
116.892 ns (3 allocations: 320 bytes)
julia> @btime loop($x,$y)
2.159 μs (52 allocations: 1.83 KiB)
```

The extra allocations in the looped function are from `fsum!.(f,g,out)`

, but I’m having trouble figuring out why. I tried @code_warntype, but I haven’t been able to interpret what’s going on.