@turbo macro giving slightly different results

Shashank · January 25, 2023, 7:00pm

I am trying to vectorize a loop, and adding the turbo gives slightly different results. The code does not rely on specific execution order as is the case with the example in the post below:

I have pasted a MWE below. Are such small differences expected due to @turbo, or am I doing something wrong?

using LoopVectorization
global ans1 = 0.0e0
@turbo for i = 1:1000000
    ans1=ans1+sqrt(i)
end
println(ans1)
    
global ans2=0.0e0

for i = 1:1000000
    global ans2
    ans2=ans2+sqrt(i)
end
println(ans2)
println(ans1-ans2)

The output on my computer is as follows:

6.666671664588217e8
6.666671664588418e8
-2.014636993408203e-5

giordano · January 25, 2023, 7:32pm

Relative difference of 1e-13 doesn’t look terribly bad, and yes, it’s expected.

simeonschaub · January 26, 2023, 12:32am

To expand on this a little bit: Floating point addition is not associative ((a + b) + c may return something different than a + (b + c)). Since @turbo is allowed to reorder the loop operations, you may therefore get slightly different results.

NB: Reading from and writing to non-constant global variables in a hot loop is never a good idea if you care about performance, so avoiding this first should give much larger performance improvements than just throwing @turbo on your loops.

Elrod · January 26, 2023, 1:25am

@turbo will at least add a function barrier, so that it should be fast, even at global scope.
That is, you should only pay the cost of a few dynamic dispatches for the code it generates outside the barrier, which is O(1) w/ respect to the number of loop iterations.

FWIW, it should be the more accurate of the two in this example on average.
Try comparing with BigFloat and see which was closer.

pitsianis · January 26, 2023, 3:04pm

To complement @simeonschaub’s answer, a simple demo like this clarifies the issue of non-associativity of floating point addition.

julia-1.8.5|v1.8> x = rand(10000); sum(x[end:-1:1]) - sum(x)
0.0

julia-1.8.5|v1.8> x = rand(10000); sum(x[end:-1:1]) - sum(x)
0.0

julia-1.8.5|v1.8> x = rand(10000); sum(x[end:-1:1]) - sum(x)
9.094947017729282e-13

The most accurate answer will be computed if you add together numbers with similar magnitudes.

Shashank · January 26, 2023, 3:18pm

Thanks a lot. I was not sure it was due to the floating point error. My concern was that using @turbo macro in my code could lead to a large error in certain circumstances. But now that I know it is due to the floating point error, I am not that worried.

Elrod · January 28, 2023, 10:37am

Shashank:

using LoopVectorization
global ans1 = 0.0e0
@turbo for i = 1:1000000
    ans1=ans1+sqrt(i)
end
println(ans1)
    
global ans2=0.0e0

for i = 1:1000000
    global ans2
    ans2=ans2+sqrt(i)
end

julia> using LoopVectorization

julia> global ans1 = 0.0e0
0.0

julia> @turbo for i = 1:1000000
           ans1=ans1+sqrt(i)
       end

julia> println(ans1)
6.666671664588217e8

julia> global ans2=0.0e0
0.0

julia> for i = 1:1000000
           global ans2
           ans2=ans2+sqrt(i)
       end

julia> ans3 = sum(sqrt∘big, 1:1000000) |> Float64;

julia> println(ans1-ans3)
-4.76837158203125e-7

julia> println(ans2-ans3)
1.9669532775878906e-5

julia> (ans2-ans3) / (ans1-ans3)
-41.25

For this particular example on my particular hardware (tiger lake CPU), the in-order sum has 40 times the error as the @turbo version.

Topic		Replies	Views
@turbo macro gives incorrect results Performance loopvectorization	4	458	October 26, 2022
Why is this loop type not supported by LoopVectorization? New to Julia	6	314	August 16, 2023
Inconsistent results using LoopVectorization @turbo with linear indexing Performance	1	239	October 2, 2023
Problems with sequential update with @turbo General Usage loopvectorization	0	48	August 20, 2024
LoopVectorization: @turbo performs worse than @inbounds on trivial loop New to Julia question , simd , loopvectorization	9	2106	August 28, 2021

@turbo macro giving slightly different results

Related topics