Experiments with LoopVectorization and convolutions

I have been doing some experiments with the great LoopVectorization for convolution, and I must admit that in spite of my efforts to manually refactor my code, I have been unable to even approach the speed that this package is giving.

Consider the following code:

using LoopVectorization, OffsetArrays, PrettyChairmarks

A = OffsetArray(rand(150, 130))
kernel = OffsetArray(rand(31,21))
out = zeros(32:151, 22:131)

function f0!(out, A, kernel)
    @turbo for J ∈ CartesianIndices(out)
        tmp = zero(eltype(out))
        for I ∈ CartesianIndices(kernel)
            @inbounds tmp += A[J-I] * kernel[I]
        end
        out[J] = tmp
    end
end

When I test the speed of f0!, I get

julia> @bs f0!($out, $A, $kernel) seconds=1
Chairmarks.Benchmark: 1192 samples with 1 evaluation.
 Range (min … max):  695.264 ΞΌs …  1.077 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     780.096 ΞΌs              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   796.765 ΞΌs Β± 83.091 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–†   β–‡   β–ˆβ–ƒ  ▂▆▁▁ β–„β–ˆβ–‚   β–ƒβ–ƒ    ▃▁     β–‚                      β–ƒ  
  β–ˆβ–‡β–…β–ˆβ–ˆβ–…β–†β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–ˆβ–ˆβ–…β–†β–†β–†β–ˆβ–ˆβ–†β–„β–…β–…β–„β–ˆβ–„β–…β–β–β–„β–β–ˆβ–…β–β–…β–‡β–β–β–β–…β–„β–„β–„β–„β–…β–„β–β–ˆ β–ˆ
  695 ΞΌs        Histogram: log(frequency) by time      1.07 ms <

 Memory estimate: 0.0 bytes, allocs estimate: 0.

After a few hours and headaches, I was able to write my best manually optimized code for convolutions:

function f1!(out, A, kernel)
    @simd for J ∈ CartesianIndices(out)
        @inbounds out[J] = zero(eltype(out))
    end
    for I ∈ CartesianIndices(kernel)
        @inbounds k = kernel[I]
        @simd ivdep for J ∈ CartesianIndices(out)
            @inbounds out[J] = muladd(A[J-I], k, out[J])
        end
    end
end

However, it is at least a factor 3 slower than the LV version:

julia> @bs f1!($out, $A, $kernel) seconds=1
Chairmarks.Benchmark: 388 samples with 1 evaluation.
 Range (min … max):  2.255 ms …   4.595 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     2.430 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.429 ms Β± 173.565 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–† β–ˆ        β–„β–‚        ▅▆▁      ▁  ▂▁         β–†β–„               
  β–ˆβ–†β–ˆβ–†β–β–β–„β–‡β–‡β–†β–‡β–ˆβ–ˆβ–†β–ˆβ–ˆβ–‡β–ˆβ–†β–†β–„β–ˆβ–ˆβ–ˆβ–†β–β–‡β–β–ˆβ–„β–ˆβ–†β–„β–ˆβ–ˆβ–„β–‡β–ˆβ–‡β–„β–ˆβ–β–ˆβ–‡β–ˆβ–ˆβ–ˆβ–‡β–β–„β–β–β–„β–β–β–β–β–†β–„ β–‡
  2.26 ms      Histogram: log(frequency) by time      2.73 ms <

 Memory estimate: 0.0 bytes, allocs estimate: 0.

Note that this has nothing to do with loop unrolling, since @turbo unroll=-1 produces exactly the same results; I obtain instead even better execution times using @turbo unroll=4. Also, trying to use directly the SIMD package did not help in any way (I just reproduced the same speed of f1!).

What is LV doing under the hoods? Can I hope to obtain a similar speed?

2 Likes

You can try LLVMLoopInfo.jl to communicate loop metadata to clang.

To rule out scalar tails you can make all sizes a multiple of a power of two (no tails – no worry).

I guess also @fastmath could make a difference here.

2 Likes

@sumiya11 thank you for your suggestions. I tried a few options of LLVMLoopInfo and unfortunately none of them, as far as I can tell, is helping in any way.

Also @fastmath is of no help: muladd has no β€œfast” version, and replacing it with a sum and multiplication does not reduce the execution time.

1 Like

What is LV doing under the hoods?

Perhaps @macroexpand can be helpful to check what @turbo is doing.

1 Like

Is LV.jl compatible with Julia v1.11?

For completeness, we should mention that for so large kernels, it is advantageous to compute the convolution in the Fourier domain. Of course, your intention here is to test loop vectorizaton.

1 Like

@pitsiannis true, FFT is faster (not by much: a factor ~30% over LV on my laptop), but my point was to understand what is LV doing on my back.

@photor there is a deprecation warning, which is part of my problem… I am currently using Julia v1.10 to avoid any possible issues. Incidentally, I hope someone knowledgeable enough takes over the maintenance of LV, as it is really a fantastic package.

1 Like

No that does not help too much, as LV is doing a lot of work in internal (generated) functions, and it is really (at least for me) very difficult to follow all these internal routines. Anyway, if you wish to try and find any hints, I would really appreciate it.

I tried @code_llvm instead on the LV-based f0! function, and, from what I can tell, there are no (obvious) SIMD instructions, which surprises me!

An intermediate-level (no pun intended) tool between @macroexpand and @code_llvm would be @code_warntype/@code_typed. Those will include the IR all those generated functions output. IIRC LoopVectorization can do some fancy transforms like loop reordering, and these reflection macros would let you catch that.

1 Like

Actually LV does a lot of deep magic and does not just generate Julia code. It compiles parts of the code by itself AFAIK. So @code_warntype and the like are not helpful. Here is an excerpt from the function from above (with some linebreak added by me):

julia> @code_typed f0!(rand(10,10), rand(10,10), rand(10,10))
...
%72 = $(Expr(:gc_preserve_begin, Core.Argument(3), Core.Argument(4), Core.Argument(2)))
β”‚          invoke LoopVectorization._turbo_!(
$(QuoteNode(Val{(false, 0, 0, 0, false, 4, 32, 15, 64, 0x0000000000000001, 1, true)}()))::Val{(false, 0, 0, 0, false, 4, 32, 15, 64, 0x0000000000000001, 1, true)}, 
$(QuoteNode(Val{(:numericconstant, Symbol("###zero###6###"),
LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0001, 0x0000), 
:I, :I, LoopVectorization.OperationStruct(0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.loopvalue, 0x0002, 0x0000), 
:J, :J, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.loopvalue, 0x0003, 0x0000), 
:LoopVectorization, :-, LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000030002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0004, 0x0000), 
:LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000004, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0005, 0x0001), 
:LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0006, 0x0002), 
:LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000500060001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0001, 0x0000), 
:LoopVectorization, :identity, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000007, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0001, 0x0000), 
Symbol("##DROPPED#CONSTANT##"), Symbol("##DROPPED#CONSTANT##"), LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000007, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0007, 0x0000), 
:LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000008, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memstore, 0x0008, 0x0003))}()))::Val{(:numericconstant, Symbol("###zero###6###"),
LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0001, 0x0000), 
:I, :I, LoopVectorization.OperationStruct(0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.loopvalue, 0x0002, 0x0000), 
:J, :J, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.loopvalue, 0x0003, 0x0000), 
:LoopVectorization, :-, LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000030002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0004, 0x0000), 
:LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000004, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0005, 0x0001), 
:LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0006, 0x0002), 
:LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000500060001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0001, 0x0000), 
:LoopVectorization, :identity, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000007, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0001, 0x0000), 
Symbol("##DROPPED#CONSTANT##"), Symbol("##DROPPED#CONSTANT##"), LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000007, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0007, 0x0000), 
:LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000008, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memstore, 0x0008, 0x0003))}, 
$(QuoteNode(Val{(LoopVectorization.ArrayRefStruct{:A, Symbol("##vptr##_A")}(0x00000000000000000000000000000002, 0x00000000000000000000000000000004, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001), 
LoopVectorization.ArrayRefStruct{:kernel, Symbol("##vptr##_kernel")}(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001), 
LoopVectorization.ArrayRefStruct{:out, Symbol("##vptr##_out")}(0x00000000000000000000000000000001, 0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001))}()))::Val{(LoopVectorization.ArrayRefStruct{:A, Symbol("##vptr##_A")}(0x00000000000000000000000000000002, 0x00000000000000000000000000000004, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001), LoopVectorization.ArrayRefStruct{:kernel, Symbol("##vptr##_kernel")}(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001), LoopVectorization.ArrayRefStruct{:out, Symbol("##vptr##_out")}(0x00000000000000000000000000000001, 0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001))}, 
$(QuoteNode(Val{(0, (), (), (), (), ((1, LoopVectorization.IntOrFloat),), ())}()))::Val{(0, (), (), (), (), ((1, LoopVectorization.IntOrFloat),), ())}, $(QuoteNode(Val{(:J, :I)}()))::Val{(:J, :I)}, $(QuoteNode(Val{Tuple{Tuple{CartesianIndices{2, Tuple{Static.OptionallyStaticUnitRange{Static.StaticInt{1}, Int64}, Static.OptionallyStaticUnitRange{Static.StaticInt{1}, Int64}}}, CartesianIndices{2, Tuple{Static.OptionallyStaticUnitRange{Static.StaticInt{1}, Int64}, Static.OptionallyStaticUnitRange{Static.StaticInt{1}, Int64}}}}, Tuple{LayoutPointers.GroupedStridedPointers{Tuple{Ptr{Float64}, Ptr{Float64}, Ptr{Float64}}, (1, 1, 1), (0, 0, 0), ((1, 2), (1, 2), (1, 2)), ((1, 2), (3, 4), (5, 6)), Tuple{Static.StaticInt{8}, Int64, Static.StaticInt{8}, Int64, Static.StaticInt{8}, Int64}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}, Vararg{Static.StaticInt{1}, 4}}}}}}()))::Val{Tuple{Tuple{CartesianIndices{2, Tuple{Static.OptionallyStaticUnitRange{Static.StaticInt{1}, Int64}, Static.OptionallyStaticUnitRange{Static.StaticInt{1}, Int64}}}, CartesianIndices{2, Tuple{Static.OptionallyStaticUnitRange{Static.StaticInt{1}, Int64}, Static.OptionallyStaticUnitRange{Static.StaticInt{1}, Int64}}}}, Tuple{LayoutPointers.GroupedStridedPointers{Tuple{Ptr{Float64}, Ptr{Float64}, Ptr{Float64}}, (1, 1, 1), (0, 0, 0), ((1, 2), (1, 2), (1, 2)), ((1, 2), (3, 4), (5, 6)), Tuple{Static.StaticInt{8}, Int64, Static.StaticInt{8}, Int64, Static.StaticInt{8}, Int64}, Tuple{Static.StaticInt{0}, Static.StaticInt{0}, Vararg{Static.StaticInt{1}, 4}}}}}}, %4::Int64, %6::Int64, %10::Int64, %12::Int64, %71::Ptr{Float64}, %63::Ptr{Float64}, %67::Ptr{Float64}, %62::Int64, %66::Int64, %70::Int64)::Any
2 Likes

Yes, indeed. From the LV manual, I understand that one of the first passes performed by LV is to convert the (nested) loops into an internal representation based on these OperationStruct that we see in the excerpt that @abroemer provided. So, unless one knows or can figure out what LV internally is, it is probably more instructive to look at the LLVM representation (which, however, is quite low-level…).

It looks like the bigger problem is that the function which does the lion’s share of work (_turbo_!) is not inlined. As such, you’d have to use Cthulhu.jl to get to the Julia or LLVM IR for most of the loop code.

That doesn’t make the massive amount of IR any easier to sift through, but it’s more informative than just seeing the setup code without the actual loops. For example, we can see that LoopVectorization.jl is emitting custom LLVM intrinsics like LLVM Language Reference Manual β€” LLVM 20.0.0git documentation while the code in f1 is not. Also, only the type information from those OperationStructs seems to be used in _turbo_!.

1 Like

You may find Optimizing Direct 1D Convolution Code and Optimizing Direct 2D Convolution Code useful.

Thank you @ToucheSir, I was reaching independently the same conclusions! And I now believe SIMD is actually used with LLVM intrinsics within _turbo_! (for example, one of the first steps is to call LoopVectorization.pick_vector_width, which computes the SIMD vector size for my processor). I will now try Cthulhu (very good point, thank you!).

Thank you @RoyiAvital, correct me if I am wrong, but these two posts just conclude that @turbo gives the optimal performance, without further analysis of the reasons for that (i.e.: what is happening behind @turbo?).