I am trying to vectorize a loop, and adding the turbo gives slightly different results. The code does not rely on specific execution order as is the case with the example in the post below:
I have pasted a MWE below. Are such small differences expected due to @turbo, or am I doing something wrong?
using LoopVectorization
global ans1 = 0.0e0
@turbo for i = 1:1000000
ans1=ans1+sqrt(i)
end
println(ans1)
global ans2=0.0e0
for i = 1:1000000
global ans2
ans2=ans2+sqrt(i)
end
println(ans2)
println(ans1-ans2)
To expand on this a little bit: Floating point addition is not associative ((a + b) + c may return something different than a + (b + c)). Since @turbo is allowed to reorder the loop operations, you may therefore get slightly different results.
NB: Reading from and writing to non-constant global variables in a hot loop is never a good idea if you care about performance, so avoiding this first should give much larger performance improvements than just throwing @turbo on your loops.
@turbo will at least add a function barrier, so that it should be fast, even at global scope.
That is, you should only pay the cost of a few dynamic dispatches for the code it generates outside the barrier, which is O(1) w/ respect to the number of loop iterations.
FWIW, it should be the more accurate of the two in this example on average.
Try comparing with BigFloat and see which was closer.
Thanks a lot. I was not sure it was due to the floating point error. My concern was that using @turbo macro in my code could lead to a large error in certain circumstances. But now that I know it is due to the floating point error, I am not that worried.