@turbo macro giving slightly different results

I am trying to vectorize a loop, and adding the turbo gives slightly different results. The code does not rely on specific execution order as is the case with the example in the post below:

I have pasted a MWE below. Are such small differences expected due to @turbo, or am I doing something wrong?

using LoopVectorization
global ans1 = 0.0e0
@turbo for i = 1:1000000
    ans1=ans1+sqrt(i)
end
println(ans1)
    
global ans2=0.0e0

for i = 1:1000000
    global ans2
    ans2=ans2+sqrt(i)
end
println(ans2)
println(ans1-ans2)

The output on my computer is as follows:

6.666671664588217e8
6.666671664588418e8
-2.014636993408203e-5

Relative difference of 1e-13 doesn’t look terribly bad, and yes, it’s expected.

2 Likes

To expand on this a little bit: Floating point addition is not associative ((a + b) + c may return something different than a + (b + c)). Since @turbo is allowed to reorder the loop operations, you may therefore get slightly different results.

NB: Reading from and writing to non-constant global variables in a hot loop is never a good idea if you care about performance, so avoiding this first should give much larger performance improvements than just throwing @turbo on your loops.

5 Likes

@turbo will at least add a function barrier, so that it should be fast, even at global scope.
That is, you should only pay the cost of a few dynamic dispatches for the code it generates outside the barrier, which is O(1) w/ respect to the number of loop iterations.

FWIW, it should be the more accurate of the two in this example on average.
Try comparing with BigFloat and see which was closer.

3 Likes

To complement @simeonschaub’s answer, a simple demo like this clarifies the issue of non-associativity of floating point addition.

julia-1.8.5|v1.8> x = rand(10000); sum(x[end:-1:1]) - sum(x)
0.0

julia-1.8.5|v1.8> x = rand(10000); sum(x[end:-1:1]) - sum(x)
0.0

julia-1.8.5|v1.8> x = rand(10000); sum(x[end:-1:1]) - sum(x)
9.094947017729282e-13

The most accurate answer will be computed if you add together numbers with similar magnitudes.

1 Like

Thanks a lot. I was not sure it was due to the floating point error. My concern was that using @turbo macro in my code could lead to a large error in certain circumstances. But now that I know it is due to the floating point error, I am not that worried.

julia> using LoopVectorization

julia> global ans1 = 0.0e0
0.0

julia> @turbo for i = 1:1000000
           ans1=ans1+sqrt(i)
       end

julia> println(ans1)
6.666671664588217e8

julia> global ans2=0.0e0
0.0

julia> for i = 1:1000000
           global ans2
           ans2=ans2+sqrt(i)
       end

julia> ans3 = sum(sqrt∘big, 1:1000000) |> Float64;

julia> println(ans1-ans3)
-4.76837158203125e-7

julia> println(ans2-ans3)
1.9669532775878906e-5

julia> (ans2-ans3) / (ans1-ans3)
-41.25

For this particular example on my particular hardware (tiger lake CPU), the in-order sum has 40 times the error as the @turbo version.

3 Likes