```@turbo``` producing different (and wrong) results compared to ```@inbounds @simd```

I have a use case where an integer array contains only values between 1 and some maxValue, and I would like the histogram of the array values. I found that counts(array, maxValue) was too slow, and as best as I could understand, it scans the array maxValue times to count each possible value separately, instead of scanning the array once.

I thought then to implement this by hand with

using LoopVectorization
function mycounts(array, maxValue)
        count = zeros(Int64, maxValue)
        @turbo for i in eachindex(array)
            count[array[i]] +=  1;
        end
        return count;
 end

However, I found that this produces the wrong results. For example mycounts(ones(Int64, 1024) gives 32 instead of 1024, and in fact increasing the size of the array, I always get 1-32th of the correct value. So perhaps this might have to do with the loop unrolling factor of @turbo? I can confirm that replacing it with @inbounds @simd, I get the correct results. Is there some nuance to the safe operation of @turbo relative to @simd that I am missing, or is this an actual bug that I should report?

(Since I have maxValue of order 100, the speed savings with @simd are already significant; yet the (wrong) implementation with @turbo gains another x5 performance boost compared to @simd which would be nice to have.)

This is a known issue. The issue is that @turbo assumes the indices array[i] do not alias one another.
Reading and writing to the same index of count at the same time (because that assumption is wrong) is why the answer is wrong.

2 Likes

@Elrod Just to be sure, I’ve had instances where I @turbo a loop that, say, adds to the same scalar accumulator, and I didn’t see an issue. So this has to do with the indexing into count? Is there a way to tell when I should and should not use @turbo?

1 Like

Yeah, if you do

count += 

it knows it is the same count, and (if it has to) will create a bunch of copies of count so that it can add them in parallel, and combine them at the end.
With

count[i] += 

it currently just assumes that every i is unique so that it doesn’t have to do that.

If it didn’t assume that, it could in principle do the same thing as it does for count, and create a bunch of parallel accumulators, one for eachindex(count). However, that could be arbitrarily expensive…

vpconflict may be a better option on CPUs with AVX512, but it isn’t too fast, so that a scalar loop may be the way to go.