```@turbo``` producing different (and wrong) results compared to ```@inbounds @simd```

bremez · March 30, 2023, 5:16pm

I have a use case where an integer array contains only values between 1 and some maxValue, and I would like the histogram of the array values. I found that counts(array, maxValue) was too slow, and as best as I could understand, it scans the array maxValue times to count each possible value separately, instead of scanning the array once.

I thought then to implement this by hand with

using LoopVectorization
function mycounts(array, maxValue)
        count = zeros(Int64, maxValue)
        @turbo for i in eachindex(array)
            count[array[i]] +=  1;
        end
        return count;
 end

However, I found that this produces the wrong results. For example mycounts(ones(Int64, 1024) gives 32 instead of 1024, and in fact increasing the size of the array, I always get 1-32th of the correct value. So perhaps this might have to do with the loop unrolling factor of @turbo? I can confirm that replacing it with @inbounds @simd, I get the correct results. Is there some nuance to the safe operation of @turbo relative to @simd that I am missing, or is this an actual bug that I should report?

(Since I have maxValue of order 100, the speed savings with @simd are already significant; yet the (wrong) implementation with @turbo gains another x5 performance boost compared to @simd which would be nice to have.)

Elrod · March 30, 2023, 5:23pm

This is a known issue. The issue is that @turbo assumes the indices array[i] do not alias one another.
Reading and writing to the same index of count at the same time (because that assumption is wrong) is why the answer is wrong.

bremez · March 30, 2023, 5:28pm

@Elrod Just to be sure, I’ve had instances where I @turbo a loop that, say, adds to the same scalar accumulator, and I didn’t see an issue. So this has to do with the indexing into count? Is there a way to tell when I should and should not use @turbo?

Elrod · March 30, 2023, 5:34pm

Yeah, if you do

count +=

it knows it is the same count, and (if it has to) will create a bunch of copies of count so that it can add them in parallel, and combine them at the end.
With

count[i] +=

it currently just assumes that every i is unique so that it doesn’t have to do that.

If it didn’t assume that, it could in principle do the same thing as it does for count, and create a bunch of parallel accumulators, one for eachindex(count). However, that could be arbitrarily expensive…

vpconflict may be a better option on CPUs with AVX512, but it isn’t too fast, so that a scalar loop may be the way to go.

Topic		Replies	Views
@turbo macro giving slightly different results General Usage loopvectorization	6	502	January 28, 2023
Inconsistent results using LoopVectorization @turbo with linear indexing Performance	1	238	October 2, 2023
LoopVectorization: @turbo performs worse than @inbounds on trivial loop New to Julia question , simd , loopvectorization	9	2098	August 28, 2021
Understanding @turbo behavior to use it right General Usage question	15	1410	September 20, 2022
@turbo macro gives incorrect results Performance loopvectorization	4	458	October 26, 2022

```@turbo``` producing different (and wrong) results compared to ```@inbounds @simd```

Related topics