How to correctly unroll loop in LoopVectorization?

Ahmed_Salih · February 14, 2024, 11:54pm

Hello!

I have made a toy script here,

Summary

Toy script example

using StaticArrays
using LoopVectorization
using StructArrays
using BenchmarkTools

struct DimensionalData{D, T}
    vectors::Tuple{Vararg{Vector{T}, D}}
    V::StructArray{SVector{D, T}, 1, Tuple{Vararg{Vector{T}, D}}}

    # General constructor for vectors
    function DimensionalData(vectors::Vector{T}...) where {T}
        D = length(vectors)
        V = StructArray{SVector{D, T}}(vectors)
        new{D, T}(Tuple(vectors), V)
    end

    # Constructor for initializing with all zeros, adapting to dimension D
    function DimensionalData{D, T}(len::Int) where {D, T}
        vectors = ntuple(d -> zeros(T, len), D) # Create D vectors of zeros
        V = StructArray{SVector{D, T}}(vectors)
        new{D, T}(vectors, V)
    end
end

function updateV!(result::DimensionalData, data::DimensionalData, I, J)
    for d ∈ 1:length(result.vectors)
        for iter ∈ eachindex(I,J)
            i, j = I[iter], J[iter]
            result.vectors[d][iter] = data.vectors[d][i] - data.vectors[d][j]  # Compute the difference for the d-th dimension
        end
    end
end

function updateVT!(result::DimensionalData, data::DimensionalData, I, J)
    for d ∈ 1:length(result.vectors)
        @tturbo for iter ∈ eachindex(I,J)
            i, j = I[iter], J[iter]
            result.vectors[d][iter] = data.vectors[d][i] - data.vectors[d][j]  # Compute the difference for the d-th dimension
        end
    end
end

function updateVTManual!(result::DimensionalData, data::DimensionalData, I, J)
        @tturbo for iter ∈ eachindex(I,J)
            i, j = I[iter], J[iter]
            result.vectors[1][iter] = data.vectors[1][i] - data.vectors[1][j]  # Compute the difference for the d-th dimension
            result.vectors[2][iter] = data.vectors[2][i] - data.vectors[2][j]  # Compute the difference for the d-th dimension
        end
end

# Create a 2D DimensionalData with vectors of length 5
let
    D    = 2
    T    = Float64
    N    = 10000
    NL   = 500000
    data = DimensionalData(rand(N),rand(N))
    I    = rand(1:N, NL)
    J    = rand(1:N, NL)
    P    = DimensionalData{2,Float64}(NL)

    println("Naive:"); display(@benchmark updateV!($P,$data,$I,$J))
    println("Turbo:"); display(@benchmark updateVT!($P,$data,$I,$J))
    println("Turbo Manual Unroll:"); display(@benchmark updateVTManual!($P,$data,$I,$J))
end

to showcase what I am trying to figure out. The issue I have is that I am working with data which can be either 1, 2 or 3 dimensional. The toy script is simply calculating the differences between values. When I benchmark I see the following results:

**Naive:**  
BenchmarkTools.Trial: 2290 samples with 1 evaluation.
 Range (min … max):  1.544 ms …   6.255 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.814 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.170 ms ± 788.018 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▇▆▆▅▅▅▄▂▃▁▂▂▁▂ ▁           ▁▂▂▁▂▁▁                        ▁
  ███████████████████▆█▇▇█▆▅▇▅▆████████▆██▇█▇▆▇▇▆▅▅▄▄▆▅▅▄▅▄▅▆ █
  1.54 ms      Histogram: log(frequency) by time      4.84 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

**Turbo:**
BenchmarkTools.Trial: 2881 samples with 1 evaluation.
 Range (min … max):  1.372 ms …  31.307 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.633 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.721 ms ± 626.874 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▆█▆▂▁▂▂
  ▃████████▇▇▇▇▅▇▇▆▅▅▅▅▆▄▆▅▅▆▅▄▃▄▄▃▄▃▃▃▃▃▃▃▂▂▃▂▁▂▂▂▂▁▂▂▁▁▁▁▁▁ ▃
  1.37 ms         Histogram: frequency by time        2.54 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

**Turbo Manual Unroll:**
BenchmarkTools.Trial: 3856 samples with 1 evaluation.
 Range (min … max):  984.900 μs …   2.480 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):       1.219 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.288 ms ± 238.877 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃▆▆█▆▇▆▅▅▄▃▅▂▃▁ ▁
  ▄██████████████████▇█▇█▆▆▆▇▆▆▅▆▆▅▅▄▅▅▄▄▄▄▄▃▄▃▃▃▃▃▃▃▃▂▃▂▂▁▂▂▂▂ ▅
  985 μs           Histogram: frequency by time         2.07 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

And I see that while @tturbo provides a performance boost in both cases, in the case where I manually unroll the loop compared to looping over the dimension, d, then I lose out on 20ish percent performance.

Is there a way in which I can get the highest speed possible while not having to manually unroll the loop?

Thanks!

abraemer · February 15, 2024, 7:13am

Hi there

I am not sure LoopVectorization can help you much here (but I am not an LV expert!). I suggest you write

function updateV2!(result::DimensionalData, data::DimensionalData, I, J)
    @inbounds for iter ∈ eachindex(I,J)
        i, j = I[iter], J[iter]
        result.V[iter] = data.V[i] - data.V[j]  # Compute the difference for the d-th dimension
    end
end

This effectively unrolls the computation inside the loop because you subtract 2 StaticArrays. You can try and change @inbounds to @tturbo but that gives a warning from LoopVectorization that it does not understand the arguments.

Swapping around the loop which I tried first

@tturbo for iter ∈ eachindex(I,J)
    for d ∈ 1:length(result.vectors)

Also does not work. I think LV has trouble understanding this doubled indexing result.vectors[d][iter] but again I am not a LV expert.

Small comment aside: It is (imo) a bit easier to read if you write NTuple{N,T} instead of Tuple{Vararg{T,N}}. So you could change the fieldtypes to

struct DimensionalData{D, T}
    vectors::NTuple{D, Vector{T}}
    V::StructArray{SVector{D, T}, 1, NTuple{D, Vector{T}}}

Elrod · February 15, 2024, 7:46am

You mean the manual is faster, and you’re asking how to avoid it?
It is faster because it is @tturbo and multithreading once.
Also, did you start Julia with only a single thread?
You’re using @tturbo, which is multithreaded but requires you to start Julia with multiple, e.g. julia -tauto. You can check with Threads.nthreads().
I get

Naive:
BenchmarkTools.Trial: 562 samples with 1 evaluation.
 Range (min … max):  1.593 ms …   5.570 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.721 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.775 ms ± 328.642 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▄█▅                                                        
  ▄▄█████▄▅▁▁▄▄▁▁▁▁▁▁▁▁▆▁▁▁▁▄▅▁▄▁▁▁▄▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▄ ▆
  1.59 ms      Histogram: log(frequency) by time      3.67 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
Turbo:
BenchmarkTools.Trial: 7191 samples with 1 evaluation.
 Range (min … max):  127.183 μs … 334.352 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     128.464 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   135.318 μs ±  14.360 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅▁  ▁▃▃▂▂▁▁▄▄▃▁ ▁▂▂▁           ▁▁▁                           ▁
  ███▇██████████████████▇▇███▇▇▇▇█████▆▆▇▆▅▇▆▅▃▅▂▃▄▂▄▃▂▃▃▄▂▂▃▄▄ █
  127 μs        Histogram: log(frequency) by time        179 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Turbo Manual Unroll:
BenchmarkTools.Trial: 6883 samples with 1 evaluation.
 Range (min … max):  135.449 μs … 339.788 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     136.725 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   141.212 μs ±   9.045 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃█▇▂    ▁▂▂▃▃▃▃▃▃▃▃▂▂▂▂▁                 ▁▁ ▁                 ▁
  ████▇▇▇█████████████████▇███▇▇██▇▇▇▆▄▆▆▇███████▇▇▆▆▅▆▆▆▆▆▇▆▅▆ █
  135 μs        Histogram: log(frequency) by time        166 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

from your example.

It does not understand it, but in this case because the for d loop is outside @tturbo, it considers d a constant w/ respect to the loop and hoists result.vectors[d] out.

You can use Base.Cartesian.@nexpr to manually unroll more conveniently:

function updateVTManual!(result::DimensionalData, data::DimensionalData, I, J)
        @tturbo for iter ∈ eachindex(I,J)
            i, j = I[iter], J[iter]
            Base.Cartesian.@nexpr 2 d -> begin
              result.vectors[d][iter] = data.vectors[d][i] - data.vectors[d][j]  # Compute the difference for the d-th dimension
            end
        end
end

and this should work like the Manual while staying DRY.

abraemer · February 15, 2024, 7:53am

For this statement I swapped the loop over d to the inside:

function updateVT_swap!(result::DimensionalData, data::DimensionalData, I, J)
    @tturbo for iter ∈ eachindex(I,J)
        for d ∈ 1:length(result.vectors)
            i, j = I[iter], J[iter]
            result.vectors[d][iter] = data.vectors[d][i] - data.vectors[d][j]  # Compute the difference for the d-th dimension
        end
    end
end

But this raises an UndefVarError: `d` not defined

Elrod:

function updateVTManual!(result::DimensionalData, data::DimensionalData, I, J)
        @tturbo for iter ∈ eachindex(I,J)
            i, j = I[iter], J[iter]
            Base.Cartesian.@nexpr 2 d -> begin
              result.vectors[d][iter] = data.vectors[d][i] - data.vectors[d][j]  # Compute the difference for the d-th dimension
            end
        end
end

Also I think this will not work. If I understand @Ahmed_Salih correctly then the number of dimensions can be 1,2 or 3 while your example hardcodes it to 2. The hand unrolled example is a bit misleading and should only demonstrate that unrolling helps.

Ahmed_Salih · February 15, 2024, 7:54am

I think you might just have saved my data in regards to Julia, I was in the process of flattening vectors trying to get it to work…

As @abraemer is saying, if I can get the “2” automized based on dims 1, 2 or 3, then I am set

In worse case I do multiple dispatch and use the same code, but have that expression hardcoded - that would triple my code base though…

Kind regards

Elrod · February 15, 2024, 7:55am

Yes, it does not understand it. I was saying it worked only because it was outside, and could thus be handled by treating it as a single index.

You can use @generated with Base.Cartesian to interpolate values.

Elrod · February 15, 2024, 7:56am

Something like this should work (I did not test):

@generated function updateVTManual!(result::DimensionalData{D}, data::DimensionalData, I, J) where {D}
    quote
        @tturbo for iter ∈ eachindex(I,J)
            i, j = I[iter], J[iter]
            Base.Cartesian.@nexpr $D d -> begin
              result.vectors[d][iter] = data.vectors[d][i] - data.vectors[d][j]  # Compute the difference for the d-th dimension
            end
        end
    end
end

You could also take an approach more like $([:(result.vectors[$d][iter] = data.vectors[$d][i] - data.vectors[$d][j]) for d in 1:D]...) instead of Base.Cartesian, which is a common pattern in StaticArrays.jl, but that’s less readable IMO.

abraemer · February 15, 2024, 7:57am

Ah I see. Thanks for clarification

I would suggest a slight modification as the information is already in the type of DimensionalData:

@generated function updateVTManual!(result::DimensionalData{D}, data::DimensionalData, I, J) where {D}
    quote
        @tturbo for iter ∈ eachindex(I,J)
            i, j = I[iter], J[iter]
            Base.Cartesian.@nexpr $D d -> begin
              result.vectors[d][iter] = data.vectors[d][i] - data.vectors[d][j]  # Compute the difference for the d-th dimension
            end
        end
    end
end

Elrod · February 15, 2024, 7:58am

Yes, definitely do that. I’m editing my comment to reflect this (the original version passed a Val{D}, which you can use if you can’t get the parameter from elsewhere).

Ahmed_Salih · February 15, 2024, 8:16am

Thank you to both of you @Elrod and @abraemer !

I got it to work and performed the benchmarks. Similar performance:

And today I learned about @generated understood when it can be needed and similarily with Base.Cartesian.@nexprs - I actually need this for something else as well, so so happy to have seen it.

In the bottom I basically check that the result calculated with the naive solution is equal to this automated one. Full script below:

Naive test script

using StaticArrays
using LoopVectorization
using StructArrays
using BenchmarkTools

struct DimensionalData{D, T}
    vectors::Tuple{Vararg{Vector{T}, D}}
    V::StructArray{SVector{D, T}, 1, Tuple{Vararg{Vector{T}, D}}}

    # General constructor for vectors
    function DimensionalData(vectors::Vector{T}...) where {T}
        D = length(vectors)
        V = StructArray{SVector{D, T}}(vectors)
        new{D, T}(Tuple(vectors), V)
    end

    # Constructor for initializing with all zeros, adapting to dimension D
    function DimensionalData{D, T}(len::Int) where {D, T}
        vectors = ntuple(d -> zeros(T, len), D) # Create D vectors of zeros
        V = StructArray{SVector{D, T}}(vectors)
        new{D, T}(vectors, V)
    end
end

function flatten(data::DimensionalData)
    flatVector = Vector{eltype(data.V)}()  # Initialize an empty vector of the appropriate type
    for vec in data.vectors
        append!(flatVector, vec)  # Append each vector's contents to the flatVector
    end
    return flatVector
end


function updateV!(result::DimensionalData, data::DimensionalData, I, J)
    for d ∈ 1:length(result.vectors)
        for iter ∈ eachindex(I,J)
            i, j = I[iter], J[iter]
            result.vectors[d][iter] = data.vectors[d][i] - data.vectors[d][j]  # Compute the difference for the d-th dimension
        end
    end
end

function updateVT!(result::DimensionalData, data::DimensionalData, I, J)
    for d ∈ 1:length(result.vectors)
        @tturbo for iter ∈ eachindex(I,J)
            i, j = I[iter], J[iter]
            result.vectors[d][iter] = data.vectors[d][i] - data.vectors[d][j]  # Compute the difference for the d-th dimension
        end
    end
end

function updateVTManual!(result::DimensionalData, data::DimensionalData, I, J)
        @tturbo for iter ∈ eachindex(I,J)
            i, j = I[iter], J[iter]
            result.vectors[1][iter] = data.vectors[1][i] - data.vectors[1][j]  # Compute the difference for the d-th dimension
            result.vectors[2][iter] = data.vectors[2][i] - data.vectors[2][j]  # Compute the difference for the d-th dimension
        end
end

using LoopVectorization

# Define a generated function to dynamically create expressions based on D
@generated function updateVTAuto!(result::DimensionalData{D}, data::DimensionalData, I, J) where {D}
    quote
        @tturbo for iter ∈ eachindex(I,J)
            i, j = I[iter], J[iter]
            Base.Cartesian.@nexprs $D d -> begin
              result.vectors[d][iter] = data.vectors[d][i] - data.vectors[d][j]  # Compute the difference for the d-th dimension
            end
        end
    end
end


# Test performance
let
    D    = 2
    T    = Float64
    N    = 10000
    NL   = 500000
    data = DimensionalData(rand(N),rand(N))
    I    = rand(1:N, NL)
    J    = rand(1:N, NL)
    P    = DimensionalData{2,Float64}(NL)

    println("Naive:"); display(@benchmark updateV!($P,$data,$I,$J))
    println("Turbo:"); display(@benchmark updateVT!($P,$data,$I,$J))
    println("Turbo Manual Unroll:"); display(@benchmark updateVTManual!($P,$data,$I,$J))
    println("Turbo Auto Unroll:"); display(@benchmark updateVTAuto!($P,$data,$I,$J))
end

# Test correctness
let
    D     = 1
    T     = Float64
    N     = 10000
    NL    = 500000
    data  = DimensionalData(rand(N),rand(N))
    I     = rand(1:N, NL)
    J     = rand(1:N, NL)
    P1    = DimensionalData{2,Float64}(NL)
    P2    = DimensionalData{2,Float64}(NL)

    updateV!(P1,data,I,J)
    updateVTAuto!(P2,data,I,J)

    println("Dimension 1 is: ", P1.vectors[1] ≈ P2.vectors[1])
    println("Dimension 2 is: ", P1.vectors[2] ≈ P2.vectors[2])

    D     = 2
    T     = Float64
    N     = 10000
    NL    = 500000
    data  = DimensionalData(rand(N),rand(N))
    I     = rand(1:N, NL)
    J     = rand(1:N, NL)
    P1    = DimensionalData{2,Float64}(NL)
    P2    = DimensionalData{2,Float64}(NL)

    updateV!(P1,data,I,J)
    updateVTAuto!(P2,data,I,J)

    println("Dimension 1 is: ", P1.vectors[1] ≈ P2.vectors[1])
    println("Dimension 2 is: ", P1.vectors[2] ≈ P2.vectors[2])

    D     = 3
    T     = Float64
    N     = 10000
    NL    = 500000
    data  = DimensionalData(rand(N),rand(N))
    I     = rand(1:N, NL)
    J     = rand(1:N, NL)
    P1    = DimensionalData{2,Float64}(NL)
    P2    = DimensionalData{2,Float64}(NL)

    updateV!(P1,data,I,J)
    updateVTAuto!(P2,data,I,J)

    println("Dimension 1 is: ", P1.vectors[1] ≈ P2.vectors[1])
    println("Dimension 2 is: ", P1.vectors[2] ≈ P2.vectors[2])
end

This is what I enjoy with Julia, the way to express code is fairly intuitive. At the end of the day, even if a newcomer wouldn’t understand @generated and Base.Cartesian.@nexprs instantly, I think that the body of the loop is well understood, which is the nicest part

Kind regards

Ahmed_Salih · February 15, 2024, 8:19am

These timings are amazing!

Can I kindly ask what CPU you are using? I am on an older I7 with I believe 4 physical cores and I have in my VSCode settings the following:

"julia.NumThreads": 4,

And in the terminal it says:

julia> Threads.nthreads()
4

So I think I am starting Julia right, but my CPU is not able to take full advantage it seems - a 10x in time in your timings really makes me start wondering how fast my code actually is hehe

Kind regards

Elrod · February 15, 2024, 9:13am

Also an older CPU (7th gen, September 2017), but one with 18 cores/36 threads:

julia> versioninfo()
Julia Version 1.10.1
Commit 7790d6f064 (2024-02-13 20:41 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 36 × Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake-avx512)
Threads: 36 default, 0 interactive, 18 GC (on 36 virtual cores)

LV will only use up to 1 thread per core, so it’ll top out at 18 even though Threads.nthreads() == 36.

Ahmed_Salih · February 15, 2024, 8:47pm

Thanks for explaining! Seems like I should consider a CPU upgrade when possible, the harder it can work for me, the faster I can fail and try again!

Just wanted to show you the impact of the Base.Cartesian.@nexprs you showed me earlier today. It has helped me make generic code for either 2D or 3D particle simulations. For the 2D case, in the past I had modelled it as 3D out of necessity. Then a timing would look something like:

Going to 2D, reducing every computation by a 3rd, gives as one might expect about 1/3 in performance increase:

So basically going from 130 to 90 seconds by removing the unnecessary extra dimension in 2D!

Kind regards

Topic		Replies	Views
A macro to unroll by hand but not by hand? Performance macros , metaprogramming	24	2863	January 19, 2022
[ANN] LoopVectorization Package Announcements	157	23288	May 27, 2020
Strange performance of a loop Performance	37	3234	July 21, 2018
LoopVectorization multithreading for multidimensional arrays Numerics loopvectorization	24	1233	March 17, 2022
ANN: LoopVectorization 0.12: multithreading and better handling of discontiguous memory accesses Performance	16	2169	March 17, 2021

How to correctly unroll loop in LoopVectorization?

Related topics