I think the key is that youâre reducing `+`

over matrices â thus each intermediate result requires an allocation. Broadcast fusion cannot save you here because it works at a *syntax* level (that is, the dots you immediately see on a single line is all the fusion you get) and thus doesnât fuse across function boundaries. And so when you move from a 32-bit reduction to a 64-bit reduction, each and every single one of those intermediate allocations is now twice as big.

Even though `@distributed`

doesnât specify the direction of the reduction, I think itâd be safe to use a semi-mutating `plus!`

function since you donât use the intermediate values after the reduction. Itâs about the same speed as plain old 32-bit `+`

since it avoids those intermediate allocations for the accumulator.

```
julia> @everywhere plus!(A, B) = plus!(Float64.(A), B)
julia> @everywhere plus!(A::AbstractArray{Float64}, B) = (A .+= B; A)
# Edit: times updated to display the second call (and not include compilation time)
julia> @time @distributed plus! for i in 1:10^7
rand(Float32, 5, 5)
end
0.291792 seconds (69.00 k allocations: 3.505 MiB)
5Ă5 Array{Float64,2}:
5.0012e6 5.00046e6 4.99949e6 4.99955e6 4.99949e6
4.9988e6 4.99835e6 4.9997e6 5.0004e6 4.99948e6
5.0003e6 5.00076e6 5.00075e6 5.0004e6 5.00004e6
4.99857e6 5.00158e6 5.00029e6 5.00151e6 5.00077e6
4.99916e6 5.0017e6 5.00112e6 4.99941e6 5.00066e6
julia> @time @distributed (+) for i in 1:10^7
rand(Float32, 5, 5)
end
0.307239 seconds (68.40 k allocations: 3.467 MiB)
5Ă5 Array{Float32,2}:
5.00007e6 4.99909e6 5.00012e6 4.99967e6 5.00126e6
5.00167e6 5.001e6 5.00013e6 4.99901e6 5.00019e6
5.00006e6 4.9996e6 5.00097e6 4.99874e6 5.00103e6
5.0004e6 5.0e6 5.00016e6 5.00088e6 4.99989e6
5.00012e6 4.99757e6 5.00069e6 4.99965e6 5.00034e6
```