Float64 reduction of Float32 computation idiom using @distributed?

fghorow · June 9, 2019, 12:16am

I have a computation where hundreds of millions of Float32 values (for speed) need to be reduction summed in Float64 accumulators.

Assuming the reduction is being driven by something like:

accum = Array{Float64}(undef,(m,n))
accum .= @distributed (+) for i in 1:100000000
    foo = Array{Float32}(undef,(m,n))
    foo .= <some Float32 computation>
    foo
end

The question is, what is the Julia (v1.1.1) idiom for doing the intermediate summation reductions in Float64?

TIA for any help you might be able to provide!

JeffreySarnoff · June 9, 2019, 1:54am

While I have no info on the @distributed part of your question, this may help.

julia> float32s = rand(Float32, 1_000_000);

julia> sum(float32s)
499742.25f0

julia> reduce(+, float32s, init = zero(Float64))
499742.24400389194

fghorow · June 10, 2019, 4:17pm

Thanks for the hint @JeffreySarnoff!

I am fairly new to Julia. However, if I am reading the code for the @distributed macro correctly, the argument [reducer] – once passed to the preduce function – can not be a reduce call, but only the ‘binary op’ part of the call to reduce. (Please correct me if I am wrong – preferably with a working toy example! )

I guess my original question still stands then. What is the idiom within the context of the @distributed macro for performing a Float64 reduction of a Float32 computation?

dpsanders · June 10, 2019, 4:32pm

Do you have a minimal working example of such a calculation? Note that you can use @macroexpand to see what the macro is doing, and then you can just adapt the resulting code as you need instead of using the macro itself.

ericphanson · June 10, 2019, 4:46pm

You should be able to pass arbitrary binary functions as the reducer, as long as they are defined on the workers. So if I’m understand what you want correctly, you could just do

@everywhere myplus(x,y) = Float64.(x) + Float64.(y)

accum = Array{Float64}(undef, 3, 2)

accum .= @distributed (myplus) for i = 1:10
       foo = rand(Float32, 3,2)
       end

fghorow · June 10, 2019, 6:37pm

@ericphanson Your suggestion indeed looked like a winner. The only unfortunate thing was that (on my SMP/hyperthreaded machine) it took ~50% longer to perform the calculation than to simply assign the result of the Float32 calculation to an @everywhere pre-allocated Float64 temporary array that participated in the reduction.
I’m now going to follow up on @dpsanders suggestion of using @macroexpand and play around some more.
Thanks everyone for the very helpful suggestions!

ericphanson · June 10, 2019, 7:04pm

Ah, okay. One other thing to try would be

@everywhere myplus(x,y) = Float64.(x) .+ Float64.(y)

i.e. with a broadcasted +. I think that could be slightly more performant since it would fuse the Float64 conversion with the elementwise summation, which could possibly help a bit.

mbauman · June 10, 2019, 7:24pm

I think the key is that you’re reducing + over matrices — thus each intermediate result requires an allocation. Broadcast fusion cannot save you here because it works at a syntax level (that is, the dots you immediately see on a single line is all the fusion you get) and thus doesn’t fuse across function boundaries. And so when you move from a 32-bit reduction to a 64-bit reduction, each and every single one of those intermediate allocations is now twice as big.

Even though @distributed doesn’t specify the direction of the reduction, I think it’d be safe to use a semi-mutating plus! function since you don’t use the intermediate values after the reduction. It’s about the same speed as plain old 32-bit + since it avoids those intermediate allocations for the accumulator.

julia> @everywhere plus!(A, B) = plus!(Float64.(A), B)

julia> @everywhere plus!(A::AbstractArray{Float64}, B) = (A .+= B; A)

# Edit: times updated to display the second call (and not include compilation time)

julia> @time @distributed plus! for i in 1:10^7
           rand(Float32, 5, 5)
       end
  0.291792 seconds (69.00 k allocations: 3.505 MiB)
5×5 Array{Float64,2}:
 5.0012e6   5.00046e6  4.99949e6  4.99955e6  4.99949e6
 4.9988e6   4.99835e6  4.9997e6   5.0004e6   4.99948e6
 5.0003e6   5.00076e6  5.00075e6  5.0004e6   5.00004e6
 4.99857e6  5.00158e6  5.00029e6  5.00151e6  5.00077e6
 4.99916e6  5.0017e6   5.00112e6  4.99941e6  5.00066e6

julia> @time @distributed (+) for i in 1:10^7
           rand(Float32, 5, 5)
       end
  0.307239 seconds (68.40 k allocations: 3.467 MiB)
5×5 Array{Float32,2}:
 5.00007e6  4.99909e6  5.00012e6  4.99967e6  5.00126e6
 5.00167e6  5.001e6    5.00013e6  4.99901e6  5.00019e6
 5.00006e6  4.9996e6   5.00097e6  4.99874e6  5.00103e6
 5.0004e6   5.0e6      5.00016e6  5.00088e6  4.99989e6
 5.00012e6  4.99757e6  5.00069e6  4.99965e6  5.00034e6

fghorow · June 10, 2019, 10:45pm

This appears to be working quite well. Thanks for the help!

(Edited to add: Sorry. I can hear my junior high school English teacher yelling at me about “indefinite antecedants!” )

“This” means @mbauman’s suggestion/example code of plus! I must say that such a solution would not have occurred to me at this stage of my Julia learning curve!

Topic		Replies	Views
@distributed macro in Float64 instead of Float32 General Usage distributed	2	527	September 7, 2019
@distributed (op) for with mutable types New to Julia distributed , loops	9	935	May 6, 2020
Question about floating-point precision in summation Performance question , package	22	2610	April 2, 2021
Parallel reductions Julia at Scale	23	5172	June 19, 2022
Help for basic usage of @distributed for Julia at Scale question	1	500	October 1, 2020

Float64 reduction of Float32 computation idiom using @distributed?

Related topics