Experiments with LoopVectorization and convolutions

Yes, hence also should work with 1.11

Tullio.jl would use LV if it was loaded, otherwise no.

But Loopvectorization.jl does work with 1.11. Or, at least I had that impression.

Oh yeah, seems like the 1.11 issues have been solved:

It’s future is still unclear but at least there is some hope!

Yes, but the benchmarks for Tullio are not great without LV. If I look at my example, I see that

@bs (@tullio threads=false $out[i, j] = $A[i-l, j-k] * $kernel[l, k]) seconds=1

gives essentially the same results of f0!, because Tullio is using LV to do the work. However, if I disable LV

@bs (@tullio threads=false avx=false $out[i, j] = $A[i-l, j-k] * $kernel[l, k]) seconds=1

then Tullio is almost ten times slower (therefore, worse than my naive implementation). Therefore, in my very specific use case, Tullio does not seem to provide any advantage. I will check, however, if it helps with GPU (although the GPU support in Tullio is experimental).

For my own reference, and with the hope that it can help other people, I am writing here what I just found from an unrelated question replied to by @Elrod.

With the help of @turbo_debug one can know what LV is doing. Redefine the function as

function f0_debug!(out::AbstractArray{T,N}, A::AbstractArray{S,N}, kernel::AbstractArray{K,N}) where {T,S,K,N}
    LoopVectorization.@turbo_debug for J ∈ CartesianIndices(out)
        tmp = zero(eltype(out))
        for I ∈ CartesianIndices(kernel)
            tmp += A[J-I] * kernel[I]
        end
        out[J] = tmp
    end
end

and then call

julia> ls = f0_debug!(out, A, kernel);

julia> LoopVectorization.choose_order(ls)
([Symbol("J#2#"), Symbol("J#1#"), Symbol("I#2#"), Symbol("I#1#")], Symbol("I#1#"), Symbol("J#2#"), Symbol("J#1#"), 1, 6)

This means that

  • LV is using the following loop orders, from the outermost to the innermost: J[2], J[1], I[2], I[1]
  • The next two symbols, I[1] and J[2], are the unrolled loops. The first is unrolled by 1, the second by 6 (the two integers values at the end of the list).
  • J[1] is the vectorized loop

There is also a related function LoopVectorization.choose_order_cost, which also provides some “cost” for the loop and a boolean indicating if LV will inline.