Experiments with LoopVectorization and convolutions

Yes, hence also should work with 1.11

Tullio.jl would use LV if it was loaded, otherwise no.

1 Like

But Loopvectorization.jl does work with 1.11. Or, at least I had that impression.

2 Likes

Oh yeah, seems like the 1.11 issues have been solved:

It’s future is still unclear but at least there is some hope!

2 Likes

Yes, but the benchmarks for Tullio are not great without LV. If I look at my example, I see that

@bs (@tullio threads=false $out[i, j] = $A[i-l, j-k] * $kernel[l, k]) seconds=1

gives essentially the same results of f0!, because Tullio is using LV to do the work. However, if I disable LV

@bs (@tullio threads=false avx=false $out[i, j] = $A[i-l, j-k] * $kernel[l, k]) seconds=1

then Tullio is almost ten times slower (therefore, worse than my naive implementation). Therefore, in my very specific use case, Tullio does not seem to provide any advantage. I will check, however, if it helps with GPU (although the GPU support in Tullio is experimental).

For my own reference, and with the hope that it can help other people, I am writing here what I just found from an unrelated question replied to by @Elrod.

With the help of @turbo_debug one can know what LV is doing. Redefine the function as

function f0_debug!(out::AbstractArray{T,N}, A::AbstractArray{S,N}, kernel::AbstractArray{K,N}) where {T,S,K,N}
    LoopVectorization.@turbo_debug for J ∈ CartesianIndices(out)
        tmp = zero(eltype(out))
        for I ∈ CartesianIndices(kernel)
            tmp += A[J-I] * kernel[I]
        end
        out[J] = tmp
    end
end

and then call

julia> ls = f0_debug!(out, A, kernel);

julia> LoopVectorization.choose_order(ls)
([Symbol("J#2#"), Symbol("J#1#"), Symbol("I#2#"), Symbol("I#1#")], Symbol("I#1#"), Symbol("J#2#"), Symbol("J#1#"), 1, 6)

This means that

  • LV is using the following loop orders, from the outermost to the innermost: J[2], J[1], I[2], I[1]
  • The next two symbols, I[1] and J[2], are the unrolled loops. The first is unrolled by 1, the second by 6 (the two integers values at the end of the list).
  • J[1] is the vectorized loop

There is also a related function LoopVectorization.choose_order_cost, which also provides some β€œcost” for the loop and a boolean indicating if LV will inline.

6 Likes