I have a computation where hundreds of millions of Float32 values (for speed) need to be reduction summed in Float64 accumulators.

Assuming the reduction is being driven by something like:

```
accum = Array{Float64}(undef,(m,n))
accum .= @distributed (+) for i in 1:100000000
foo = Array{Float32}(undef,(m,n))
foo .= <some Float32 computation>
foo
end
```

The question is, what is the Julia (v1.1.1) idiom for doing the intermediate summation reductions in Float64?

TIA for any help you might be able to provide!

While I have no info on the `@distributed`

part of your question, this may help.

```
julia> float32s = rand(Float32, 1_000_000);
julia> sum(float32s)
499742.25f0
julia> reduce(+, float32s, init = zero(Float64))
499742.24400389194
```

2 Likes

Thanks for the hint @JeffreySarnoff!

I am fairly new to Julia. However, if I am reading the code for the @distributed macro correctly, the argument `[reducer]`

â once passed to the `preduce`

function â can not be a `reduce`

call, but only the âbinary opâ part of the call to `reduce`

. (Please correct me if I am wrong â preferably with a working toy example! )

I guess my original question still stands then. What is the idiom *within the context of the @distributed macro* for performing a Float64 reduction of a Float32 computation?

Do you have a minimal working example of such a calculation? Note that you can use `@macroexpand`

to see what the macro is doing, and then you can just adapt the resulting code as you need instead of using the macro itself.

You should be able to pass arbitrary binary functions as the reducer, as long as they are defined on the workers. So if Iâm understand what you want correctly, you could just do

```
@everywhere myplus(x,y) = Float64.(x) + Float64.(y)
accum = Array{Float64}(undef, 3, 2)
accum .= @distributed (myplus) for i = 1:10
foo = rand(Float32, 3,2)
end
```

1 Like

@ericphanson Your suggestion indeed looked like a winner. The only unfortunate thing was that (on my SMP/hyperthreaded machine) it took ~50% longer to perform the calculation than to simply assign the result of the Float32 calculation to an @everywhere pre-allocated Float64 temporary array that participated in the reduction.

Iâm now going to follow up on @dpsanders suggestion of using @macroexpand and play around some more.

Thanks everyone for the very helpful suggestions!

Ah, okay. One other thing to try would be

```
@everywhere myplus(x,y) = Float64.(x) .+ Float64.(y)
```

i.e. with a broadcasted `+`

. I think that could be slightly more performant since it would fuse the `Float64`

conversion with the elementwise summation, which could possibly help a bit.

I think the key is that youâre reducing `+`

over matrices â thus each intermediate result requires an allocation. Broadcast fusion cannot save you here because it works at a *syntax* level (that is, the dots you immediately see on a single line is all the fusion you get) and thus doesnât fuse across function boundaries. And so when you move from a 32-bit reduction to a 64-bit reduction, each and every single one of those intermediate allocations is now twice as big.

Even though `@distributed`

doesnât specify the direction of the reduction, I think itâd be safe to use a semi-mutating `plus!`

function since you donât use the intermediate values after the reduction. Itâs about the same speed as plain old 32-bit `+`

since it avoids those intermediate allocations for the accumulator.

```
julia> @everywhere plus!(A, B) = plus!(Float64.(A), B)
julia> @everywhere plus!(A::AbstractArray{Float64}, B) = (A .+= B; A)
# Edit: times updated to display the second call (and not include compilation time)
julia> @time @distributed plus! for i in 1:10^7
rand(Float32, 5, 5)
end
0.291792 seconds (69.00 k allocations: 3.505 MiB)
5Ă5 Array{Float64,2}:
5.0012e6 5.00046e6 4.99949e6 4.99955e6 4.99949e6
4.9988e6 4.99835e6 4.9997e6 5.0004e6 4.99948e6
5.0003e6 5.00076e6 5.00075e6 5.0004e6 5.00004e6
4.99857e6 5.00158e6 5.00029e6 5.00151e6 5.00077e6
4.99916e6 5.0017e6 5.00112e6 4.99941e6 5.00066e6
julia> @time @distributed (+) for i in 1:10^7
rand(Float32, 5, 5)
end
0.307239 seconds (68.40 k allocations: 3.467 MiB)
5Ă5 Array{Float32,2}:
5.00007e6 4.99909e6 5.00012e6 4.99967e6 5.00126e6
5.00167e6 5.001e6 5.00013e6 4.99901e6 5.00019e6
5.00006e6 4.9996e6 5.00097e6 4.99874e6 5.00103e6
5.0004e6 5.0e6 5.00016e6 5.00088e6 4.99989e6
5.00012e6 4.99757e6 5.00069e6 4.99965e6 5.00034e6
```

4 Likes

This appears to be working quite well. Thanks for the help!

(Edited to add: Sorry. I can hear my junior high school English teacher yelling at me about âindefinite antecedants!â )

âThisâ means @mbaumanâs suggestion/example code of `plus!`

I must say that such a solution would not have occurred to me at this stage of my Julia learning curve!

1 Like