Experiments with LoopVectorization and convolutions

roflmaostc · December 3, 2024, 5:47am

Yes, hence also should work with 1.11

Tullio.jl would use LV if it was loaded, otherwise no.

DNF · December 3, 2024, 6:28am

But Loopvectorization.jl does work with 1.11. Or, at least I had that impression.

roflmaostc · December 3, 2024, 6:36am

Oh yeah, seems like the 1.11 issues have been solved:

It’s future is still unclear but at least there is some hope!

Marco_Lombardi · December 3, 2024, 7:40pm

Yes, but the benchmarks for Tullio are not great without LV. If I look at my example, I see that

@bs (@tullio threads=false $out[i, j] = $A[i-l, j-k] * $kernel[l, k]) seconds=1

gives essentially the same results of f0!, because Tullio is using LV to do the work. However, if I disable LV

@bs (@tullio threads=false avx=false $out[i, j] = $A[i-l, j-k] * $kernel[l, k]) seconds=1

then Tullio is almost ten times slower (therefore, worse than my naive implementation). Therefore, in my very specific use case, Tullio does not seem to provide any advantage. I will check, however, if it helps with GPU (although the GPU support in Tullio is experimental).

Marco_Lombardi · December 3, 2024, 8:35pm

For my own reference, and with the hope that it can help other people, I am writing here what I just found from an unrelated question replied to by @Elrod.

With the help of @turbo_debug one can know what LV is doing. Redefine the function as

function f0_debug!(out::AbstractArray{T,N}, A::AbstractArray{S,N}, kernel::AbstractArray{K,N}) where {T,S,K,N}
    LoopVectorization.@turbo_debug for J ∈ CartesianIndices(out)
        tmp = zero(eltype(out))
        for I ∈ CartesianIndices(kernel)
            tmp += A[J-I] * kernel[I]
        end
        out[J] = tmp
    end
end

and then call

julia> ls = f0_debug!(out, A, kernel);

julia> LoopVectorization.choose_order(ls)
([Symbol("J#2#"), Symbol("J#1#"), Symbol("I#2#"), Symbol("I#1#")], Symbol("I#1#"), Symbol("J#2#"), Symbol("J#1#"), 1, 6)

This means that

LV is using the following loop orders, from the outermost to the innermost: J[2], J[1], I[2], I[1]
The next two symbols, I[1] and J[2], are the unrolled loops. The first is unrolled by 1, the second by 6 (the two integers values at the end of the list).
J[1] is the vectorized loop

There is also a related function LoopVectorization.choose_order_cost, which also provides some “cost” for the loop and a boolean indicating if LV will inline.

Topic		Replies	Views
LoopVectorization: @turbo performs worse than @inbounds on trivial loop New to Julia question , simd , loopvectorization	9	2094	August 28, 2021
LoopVectorization almost doubles execution time? Performance loopvectorization	6	655	July 9, 2021
ANN: LoopVectorization 0.12: multithreading and better handling of discontiguous memory accesses Performance	16	2156	March 17, 2021
PaddedViews very slow Performance simd , tullio , paddedviews	7	604	August 25, 2021
Performance of naive convolution against Python Numpy New to Julia performance , loopvectorization	20	2879	February 4, 2022

Experiments with LoopVectorization and convolutions

Related topics