Float64 reduction of Float32 computation idiom using @distributed?

I have a computation where hundreds of millions of Float32 values (for speed) need to be reduction summed in Float64 accumulators.

Assuming the reduction is being driven by something like:

accum = Array{Float64}(undef,(m,n))
accum .= @distributed (+) for i in 1:100000000
    foo = Array{Float32}(undef,(m,n))
    foo .= <some Float32 computation>
    foo
end

The question is, what is the Julia (v1.1.1) idiom for doing the intermediate summation reductions in Float64?

TIA for any help you might be able to provide!

While I have no info on the @distributed part of your question, this may help.

julia> float32s = rand(Float32, 1_000_000);

julia> sum(float32s)
499742.25f0

julia> reduce(+, float32s, init = zero(Float64))
499742.24400389194
2 Likes

Thanks for the hint @JeffreySarnoff!

I am fairly new to Julia. However, if I am reading the code for the @distributed macro correctly, the argument [reducer] – once passed to the preduce function – can not be a reduce call, but only the ‘binary op’ part of the call to reduce. (Please correct me if I am wrong – preferably with a working toy example! :wink: )

I guess my original question still stands then. What is the idiom within the context of the @distributed macro for performing a Float64 reduction of a Float32 computation?

Do you have a minimal working example of such a calculation? Note that you can use @macroexpand to see what the macro is doing, and then you can just adapt the resulting code as you need instead of using the macro itself.

You should be able to pass arbitrary binary functions as the reducer, as long as they are defined on the workers. So if I’m understand what you want correctly, you could just do

@everywhere myplus(x,y) = Float64.(x) + Float64.(y)

accum = Array{Float64}(undef, 3, 2)

accum .= @distributed (myplus) for i = 1:10
       foo = rand(Float32, 3,2)
       end
1 Like

@ericphanson Your suggestion indeed looked like a winner. The only unfortunate thing was that (on my SMP/hyperthreaded machine) it took ~50% longer to perform the calculation than to simply assign the result of the Float32 calculation to an @everywhere pre-allocated Float64 temporary array that participated in the reduction. :frowning:
I’m now going to follow up on @dpsanders suggestion of using @macroexpand and play around some more.
Thanks everyone for the very helpful suggestions!

Ah, okay. One other thing to try would be

@everywhere myplus(x,y) = Float64.(x) .+ Float64.(y)

i.e. with a broadcasted +. I think that could be slightly more performant since it would fuse the Float64 conversion with the elementwise summation, which could possibly help a bit.

I think the key is that you’re reducing + over matrices — thus each intermediate result requires an allocation. Broadcast fusion cannot save you here because it works at a syntax level (that is, the dots you immediately see on a single line is all the fusion you get) and thus doesn’t fuse across function boundaries. And so when you move from a 32-bit reduction to a 64-bit reduction, each and every single one of those intermediate allocations is now twice as big.

Even though @distributed doesn’t specify the direction of the reduction, I think it’d be safe to use a semi-mutating plus! function since you don’t use the intermediate values after the reduction. It’s about the same speed as plain old 32-bit + since it avoids those intermediate allocations for the accumulator.

julia> @everywhere plus!(A, B) = plus!(Float64.(A), B)

julia> @everywhere plus!(A::AbstractArray{Float64}, B) = (A .+= B; A)

# Edit: times updated to display the second call (and not include compilation time)

julia> @time @distributed plus! for i in 1:10^7
           rand(Float32, 5, 5)
       end
  0.291792 seconds (69.00 k allocations: 3.505 MiB)
5×5 Array{Float64,2}:
 5.0012e6   5.00046e6  4.99949e6  4.99955e6  4.99949e6
 4.9988e6   4.99835e6  4.9997e6   5.0004e6   4.99948e6
 5.0003e6   5.00076e6  5.00075e6  5.0004e6   5.00004e6
 4.99857e6  5.00158e6  5.00029e6  5.00151e6  5.00077e6
 4.99916e6  5.0017e6   5.00112e6  4.99941e6  5.00066e6

julia> @time @distributed (+) for i in 1:10^7
           rand(Float32, 5, 5)
       end
  0.307239 seconds (68.40 k allocations: 3.467 MiB)
5×5 Array{Float32,2}:
 5.00007e6  4.99909e6  5.00012e6  4.99967e6  5.00126e6
 5.00167e6  5.001e6    5.00013e6  4.99901e6  5.00019e6
 5.00006e6  4.9996e6   5.00097e6  4.99874e6  5.00103e6
 5.0004e6   5.0e6      5.00016e6  5.00088e6  4.99989e6
 5.00012e6  4.99757e6  5.00069e6  4.99965e6  5.00034e6
4 Likes

This appears to be working quite well. Thanks for the help!

(Edited to add: Sorry. I can hear my junior high school English teacher yelling at me about “indefinite antecedants!” :wink: )

“This” means @mbauman’s suggestion/example code of plus! I must say that such a solution would not have occurred to me at this stage of my Julia learning curve!

1 Like