Why is this loop type not supported by LoopVectorization?

james3 · August 16, 2023, 10:06pm

I am trying to understand why I can’t use @turbo in part of my code. I can reduce the issue to the following example

using LoopVectorization

f = [15, 13, 12, 10, 9, 7, 5, 4, 2, 1]
y0 = zeros(16)
y1 = zeros(16)

for i = 1:10
    y0[f[i]] += 1.0
    y0[f[i] + 1] += 2.0
end

@turbo for i = 1:10
    y1[f[i]] += 1.0
    y1[f[i] + 1] += 2.0
end

I then get

julia> [y0 y1]
16×2 Matrix{Float64}:
 1.0  1.0
 3.0  2.0
 2.0  2.0
 1.0  1.0
 3.0  2.0
 2.0  2.0
 1.0  1.0
 2.0  2.0
 1.0  1.0
 3.0  2.0
 2.0  2.0
 1.0  1.0
 3.0  2.0
 2.0  2.0
 1.0  1.0
 2.0  2.0

So I am not getting the same output when I use @turbo. Why is it not appropriate to use it in this case? It was not obvious to me from reading LoopVectorization.jl limitations, so I suspect I may have misunderstood something. Thanks.

stevengj · August 16, 2023, 10:11pm

From the @turbo documentation:

It assumes that loop iterations are independent.

Here, the loop iterations are somewhat dependent (though they can be executed in any order) because different iterations modify the same y1 element, which means that they cannot be executed in parallel (SIMD is instruction-level parallelism). And you can see that it precisely for these elements (which have value 3.0 after the scalar loop) that the results differ.

jar1 · August 16, 2023, 10:12pm

What is the definition of independence?

stevengj · August 16, 2023, 10:13pm

google “loop independence”

DNF · August 16, 2023, 10:20pm

If you can rearrange the order of the iterations without changing the result of the calculation.

For example, what happens if you run this many times:

for i = shuffle(1:10) 
    y0[f[i]] += 1.0
    y0[f[i] + 1] += 2.0
end

Do you always get the same result?

DNF · August 16, 2023, 10:49pm

I would guess that invariance to shuffling is not enough either. The different iterations should be able to execute simultaneously, so check for possible data races.

stevengj · August 16, 2023, 11:15pm

In this case yes, because addition is commutative. But the iterations cannot execute arbitrarily in parallel without data races.

Topic		Replies	Views
@turbo macro gives incorrect results Performance loopvectorization	4	458	October 26, 2022
@turbo macro giving slightly different results General Usage loopvectorization	6	505	January 28, 2023
Inconsistent results using LoopVectorization @turbo with linear indexing Performance	1	239	October 2, 2023
@turbo speeds routine, slows down everything else Performance loopvectorization	16	2610	June 5, 2021
How to use LoopVectorization.jl? General Usage parallel , loopvectorization	7	1317	September 15, 2023

Why is this loop type not supported by LoopVectorization?

Related topics