Experiments with LoopVectorization and convolutions

Marco_Lombardi · November 27, 2024, 8:50pm

I have been doing some experiments with the great LoopVectorization for convolution, and I must admit that in spite of my efforts to manually refactor my code, I have been unable to even approach the speed that this package is giving.

Consider the following code:

using LoopVectorization, OffsetArrays, PrettyChairmarks

A = OffsetArray(rand(150, 130))
kernel = OffsetArray(rand(31,21))
out = zeros(32:151, 22:131)

function f0!(out, A, kernel)
    @turbo for J ∈ CartesianIndices(out)
        tmp = zero(eltype(out))
        for I ∈ CartesianIndices(kernel)
            @inbounds tmp += A[J-I] * kernel[I]
        end
        out[J] = tmp
    end
end

When I test the speed of f0!, I get

julia> @bs f0!($out, $A, $kernel) seconds=1
Chairmarks.Benchmark: 1192 samples with 1 evaluation.
 Range (min … max):  695.264 μs …  1.077 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     780.096 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   796.765 μs ± 83.091 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆   ▇   █▃  ▂▆▁▁ ▄█▂   ▃▃    ▃▁     ▂                      ▃  
  █▇▅██▅▆▆█████████████▆▆██▅▆▆▆██▆▄▅▅▄█▄▅▁▁▄▁█▅▁▅▇▁▁▁▅▄▄▄▄▅▄▁█ █
  695 μs        Histogram: log(frequency) by time      1.07 ms <

 Memory estimate: 0.0 bytes, allocs estimate: 0.

After a few hours and headaches, I was able to write my best manually optimized code for convolutions:

function f1!(out, A, kernel)
    @simd for J ∈ CartesianIndices(out)
        @inbounds out[J] = zero(eltype(out))
    end
    for I ∈ CartesianIndices(kernel)
        @inbounds k = kernel[I]
        @simd ivdep for J ∈ CartesianIndices(out)
            @inbounds out[J] = muladd(A[J-I], k, out[J])
        end
    end
end

However, it is at least a factor 3 slower than the LV version:

julia> @bs f1!($out, $A, $kernel) seconds=1
Chairmarks.Benchmark: 388 samples with 1 evaluation.
 Range (min … max):  2.255 ms …   4.595 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.430 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.429 ms ± 173.565 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆ █        ▄▂        ▅▆▁      ▁  ▂▁         ▆▄               
  █▆█▆▁▁▄▇▇▆▇██▆██▇█▆▆▄███▆▁▇▁█▄█▆▄██▄▇█▇▄█▁█▇███▇▁▄▁▁▄▁▁▁▁▆▄ ▇
  2.26 ms      Histogram: log(frequency) by time      2.73 ms <

 Memory estimate: 0.0 bytes, allocs estimate: 0.

Note that this has nothing to do with loop unrolling, since @turbo unroll=-1 produces exactly the same results; I obtain instead even better execution times using @turbo unroll=4. Also, trying to use directly the SIMD package did not help in any way (I just reproduced the same speed of f1!).

What is LV doing under the hoods? Can I hope to obtain a similar speed?

sumiya11 · November 27, 2024, 9:57pm

You can try LLVMLoopInfo.jl to communicate loop metadata to clang.

To rule out scalar tails you can make all sizes a multiple of a power of two (no tails – no worry).

I guess also @fastmath could make a difference here.

Marco_Lombardi · November 27, 2024, 11:43pm

@sumiya11 thank you for your suggestions. I tried a few options of LLVMLoopInfo and unfortunately none of them, as far as I can tell, is helping in any way.

Also @fastmath is of no help: muladd has no “fast” version, and replacing it with a sum and multiplication does not reduce the execution time.

sumiya11 · November 28, 2024, 12:12am

What is LV doing under the hoods?

Perhaps @macroexpand can be helpful to check what @turbo is doing.

photor · November 28, 2024, 12:37pm

Is LV.jl compatible with Julia v1.11?

pitsianis · November 28, 2024, 2:43pm

For completeness, we should mention that for so large kernels, it is advantageous to compute the convolution in the Fourier domain. Of course, your intention here is to test loop vectorizaton.

Marco_Lombardi · November 28, 2024, 8:41pm

@pitsiannis true, FFT is faster (not by much: a factor ~30% over LV on my laptop), but my point was to understand what is LV doing on my back.

Marco_Lombardi · November 28, 2024, 8:44pm

@photor there is a deprecation warning, which is part of my problem… I am currently using Julia v1.10 to avoid any possible issues. Incidentally, I hope someone knowledgeable enough takes over the maintenance of LV, as it is really a fantastic package.

Marco_Lombardi · November 28, 2024, 8:49pm

No that does not help too much, as LV is doing a lot of work in internal (generated) functions, and it is really (at least for me) very difficult to follow all these internal routines. Anyway, if you wish to try and find any hints, I would really appreciate it.

I tried @code_llvm instead on the LV-based f0! function, and, from what I can tell, there are no (obvious) SIMD instructions, which surprises me!

ToucheSir · November 28, 2024, 9:47pm

An intermediate-level (no pun intended) tool between @macroexpand and @code_llvm would be @code_warntype/@code_typed. Those will include the IR all those generated functions output. IIRC LoopVectorization can do some fancy transforms like loop reordering, and these reflection macros would let you catch that.

abraemer · November 29, 2024, 8:18am

Actually LV does a lot of deep magic and does not just generate Julia code. It compiles parts of the code by itself AFAIK. So @code_warntype and the like are not helpful. Here is an excerpt from the function from above (with some linebreak added by me):

julia> @code_typed f0!(rand(10,10), rand(10,10), rand(10,10))
...
%72 = $(Expr(:gc_preserve_begin, Core.Argument(3), Core.Argument(4), Core.Argument(2)))
│          invoke LoopVectorization._turbo_!(
$(QuoteNode(Val{(false, 0, 0, 0, false, 4, 32, 15, 64, 0x0000000000000001, 1, true)}()))::Val{(false, 0, 0, 0, false, 4, 32, 15, 64, 0x0000000000000001, 1, true)}, 
$(QuoteNode(Val{(:numericconstant, Symbol("###zero###6###"),
LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0001, 0x0000), 
:I, :I, LoopVectorization.OperationStruct(0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.loopvalue, 0x0002, 0x0000), 
:J, :J, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.loopvalue, 0x0003, 0x0000), 
:LoopVectorization, :-, LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000030002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0004, 0x0000), 
:LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000004, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0005, 0x0001), 
:LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0006, 0x0002), 
:LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000500060001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0001, 0x0000), 
:LoopVectorization, :identity, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000007, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0001, 0x0000), 
Symbol("##DROPPED#CONSTANT##"), Symbol("##DROPPED#CONSTANT##"), LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000007, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0007, 0x0000), 
:LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000008, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memstore, 0x0008, 0x0003))}()))::Val{(:numericconstant, Symbol("###zero###6###"),
LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0001, 0x0000), 
:I, :I, LoopVectorization.OperationStruct(0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.loopvalue, 0x0002, 0x0000), 
:J, :J, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.loopvalue, 0x0003, 0x0000), 
:LoopVectorization, :-, LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000030002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0004, 0x0000), 
:LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000004, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0005, 0x0001), 
:LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0006, 0x0002), 
:LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000500060001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0001, 0x0000), 
:LoopVectorization, :identity, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000007, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0001, 0x0000), 
Symbol("##DROPPED#CONSTANT##"), Symbol("##DROPPED#CONSTANT##"), LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000007, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0007, 0x0000), 
:LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000008, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memstore, 0x0008, 0x0003))}, 
$(QuoteNode(Val{(LoopVectorization.ArrayRefStruct{:A, Symbol("##vptr##_A")}(0x00000000000000000000000000000002, 0x00000000000000000000000000000004, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001), 
LoopVectorization.ArrayRefStruct{:kernel, Symbol("##vptr##_kernel")}(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001), 
LoopVectorization.ArrayRefStruct{:out, Symbol("##vptr##_out")}(0x00000000000000000000000000000001, 0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001))}()))::Val{(LoopVectorization.ArrayRefStruct{:A, Symbol("##vptr##_A")}(0x00000000000000000000000000000002, 0x00000000000000000000000000000004, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001), LoopVectorization.ArrayRefStruct{:kernel, Symbol("##vptr##_kernel")}(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001), LoopVectorization.ArrayRefStruct{:out, Symbol("##vptr##_out")}(0x00000000000000000000000000000001, 0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001))}, 
$(QuoteNode(Val{(0, (), (), (), (), ((1, LoopVectorization.IntOrFloat),), ())}()))::Val{(0, (), (), (), (), ((1, LoopVectorization.IntOrFloat),), ())}, $(QuoteNode(Val{(:J, :I)}()))::Val{(:J, :I)}, $(QuoteNode(Val{Tuple{Tuple{CartesianIndices{2, Tuple{Static.OptionallyStaticUnitRange{Static.StaticInt{1}, Int64}, Static.OptionallyStaticUnitRange{Static.StaticInt{1}, Int64}}}, CartesianIndices{2, Tuple{Static.OptionallyStaticUnitRange{Static.StaticInt{1}, Int64}, Static.OptionallyStaticUnitRange{Static.StaticInt{1}, Int64}}}}, Tuple{LayoutPointers.GroupedStridedPointers{Tuple{Ptr{Float64}, Ptr{Float64}, Ptr{Float64}}, (1, 1, 1), (0, 0, 0), ((1, 2), (1, 2), (1, 2)), ((1, 2), (3, 4), (5, 6)), Tuple{Static.StaticInt{8}, Int64, Static.StaticInt{8}, Int64, Static.StaticInt{8}, Int64}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}, Vararg{Static.StaticInt{1}, 4}}}}}}()))::Val{Tuple{Tuple{CartesianIndices{2, Tuple{Static.OptionallyStaticUnitRange{Static.StaticInt{1}, Int64}, Static.OptionallyStaticUnitRange{Static.StaticInt{1}, Int64}}}, CartesianIndices{2, Tuple{Static.OptionallyStaticUnitRange{Static.StaticInt{1}, Int64}, Static.OptionallyStaticUnitRange{Static.StaticInt{1}, Int64}}}}, Tuple{LayoutPointers.GroupedStridedPointers{Tuple{Ptr{Float64}, Ptr{Float64}, Ptr{Float64}}, (1, 1, 1), (0, 0, 0), ((1, 2), (1, 2), (1, 2)), ((1, 2), (3, 4), (5, 6)), Tuple{Static.StaticInt{8}, Int64, Static.StaticInt{8}, Int64, Static.StaticInt{8}, Int64}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}, Vararg{Static.StaticInt{1}, 4}}}}}}, %4::Int64, %6::Int64, %10::Int64, %12::Int64, %71::Ptr{Float64}, %63::Ptr{Float64}, %67::Ptr{Float64}, %62::Int64, %66::Int64, %70::Int64)::Any

Marco_Lombardi · November 29, 2024, 1:49pm

Yes, indeed. From the LV manual, I understand that one of the first passes performed by LV is to convert the (nested) loops into an internal representation based on these OperationStruct that we see in the excerpt that @abroemer provided. So, unless one knows or can figure out what LV internally is, it is probably more instructive to look at the LLVM representation (which, however, is quite low-level…).

ToucheSir · November 29, 2024, 5:59pm

It looks like the bigger problem is that the function which does the lion’s share of work (_turbo_!) is not inlined. As such, you’d have to use Cthulhu.jl to get to the Julia or LLVM IR for most of the loop code.

That doesn’t make the massive amount of IR any easier to sift through, but it’s more informative than just seeing the setup code without the actual loops. For example, we can see that LoopVectorization.jl is emitting custom LLVM intrinsics like LLVM Language Reference Manual — LLVM 20.0.0git documentation while the code in f1 is not. Also, only the type information from those OperationStructs seems to be used in _turbo_!.

RoyiAvital · November 29, 2024, 6:13pm

You may find Optimizing Direct 1D Convolution Code and Optimizing Direct 2D Convolution Code useful.

Marco_Lombardi · November 29, 2024, 9:30pm

Thank you @ToucheSir, I was reaching independently the same conclusions! And I now believe SIMD is actually used with LLVM intrinsics within _turbo_! (for example, one of the first steps is to call LoopVectorization.pick_vector_width, which computes the SIMD vector size for my processor). I will now try Cthulhu (very good point, thank you!).

Marco_Lombardi · November 29, 2024, 9:33pm

Thank you @RoyiAvital, correct me if I am wrong, but these two posts just conclude that @turbo gives the optimal performance, without further analysis of the reasons for that (i.e.: what is happening behind @turbo?).

Elrod · December 2, 2024, 12:42am

Yes, use Cthulhu to descend into it, then look at optimized native code or llvm IR of _turbo_!.

LV reorders the loops, and vectorizes an outer loop.

Marco_Lombardi · December 2, 2024, 9:36pm

Thank you @Elrod, indeed Cthulhu has clarified a lot of things and your message has also been very helpful, as I now have a “manual” implementation that approaches LV speed. Incidentally, LV is fantastic and I hope someone competent will help you maintain the code! [But I am also looking forward to use LoopModels…]

roflmaostc · December 3, 2024, 1:42am

No one has mentioned Tullio.jl yet which has great performance in those kind of cases.

photor · December 3, 2024, 3:19am

Is Tullio.jl independent of LV.jl?

Topic		Replies	Views
Tullio seems two times slower than basic LoopVectorization Performance question , tullio , loopvectorization	3	1061	April 10, 2022
LoopVec, Tullio losing to Matrix multiplication Performance	9	736	July 25, 2024
How to make LoopVectorization work with computing Euclidean distance for every column? New to Julia tullio , loopvectorization	13	1120	August 21, 2020
[ANN] LoopVectorization Package Announcements	157	23292	May 27, 2020
ANN: LoopVectorization 0.12: multithreading and better handling of discontiguous memory accesses Performance	16	2169	March 17, 2021

Experiments with LoopVectorization and convolutions

Related topics