How to correctly unroll loop in LoopVectorization?

Hello!

I have made a toy script here,

Summary

Toy script example

using StaticArrays
using LoopVectorization
using StructArrays
using BenchmarkTools

struct DimensionalData{D, T}
    vectors::Tuple{Vararg{Vector{T}, D}}
    V::StructArray{SVector{D, T}, 1, Tuple{Vararg{Vector{T}, D}}}

    # General constructor for vectors
    function DimensionalData(vectors::Vector{T}...) where {T}
        D = length(vectors)
        V = StructArray{SVector{D, T}}(vectors)
        new{D, T}(Tuple(vectors), V)
    end

    # Constructor for initializing with all zeros, adapting to dimension D
    function DimensionalData{D, T}(len::Int) where {D, T}
        vectors = ntuple(d -> zeros(T, len), D) # Create D vectors of zeros
        V = StructArray{SVector{D, T}}(vectors)
        new{D, T}(vectors, V)
    end
end

function updateV!(result::DimensionalData, data::DimensionalData, I, J)
    for d ∈ 1:length(result.vectors)
        for iter ∈ eachindex(I,J)
            i, j = I[iter], J[iter]
            result.vectors[d][iter] = data.vectors[d][i] - data.vectors[d][j]  # Compute the difference for the d-th dimension
        end
    end
end

function updateVT!(result::DimensionalData, data::DimensionalData, I, J)
    for d ∈ 1:length(result.vectors)
        @tturbo for iter ∈ eachindex(I,J)
            i, j = I[iter], J[iter]
            result.vectors[d][iter] = data.vectors[d][i] - data.vectors[d][j]  # Compute the difference for the d-th dimension
        end
    end
end

function updateVTManual!(result::DimensionalData, data::DimensionalData, I, J)
        @tturbo for iter ∈ eachindex(I,J)
            i, j = I[iter], J[iter]
            result.vectors[1][iter] = data.vectors[1][i] - data.vectors[1][j]  # Compute the difference for the d-th dimension
            result.vectors[2][iter] = data.vectors[2][i] - data.vectors[2][j]  # Compute the difference for the d-th dimension
        end
end

# Create a 2D DimensionalData with vectors of length 5
let
    D    = 2
    T    = Float64
    N    = 10000
    NL   = 500000
    data = DimensionalData(rand(N),rand(N))
    I    = rand(1:N, NL)
    J    = rand(1:N, NL)
    P    = DimensionalData{2,Float64}(NL)

    println("Naive:"); display(@benchmark updateV!($P,$data,$I,$J))
    println("Turbo:"); display(@benchmark updateVT!($P,$data,$I,$J))
    println("Turbo Manual Unroll:"); display(@benchmark updateVTManual!($P,$data,$I,$J))
end

to showcase what I am trying to figure out. The issue I have is that I am working with data which can be either 1, 2 or 3 dimensional. The toy script is simply calculating the differences between values. When I benchmark I see the following results:

**Naive:**  
BenchmarkTools.Trial: 2290 samples with 1 evaluation.
 Range (min … max):  1.544 ms …   6.255 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     1.814 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   2.170 ms Β± 788.018 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–†β–ˆβ–‡β–†β–†β–…β–…β–…β–„β–‚β–ƒβ–β–‚β–‚β–β–‚ ▁           ▁▂▂▁▂▁▁                        ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–ˆβ–‡β–‡β–ˆβ–†β–…β–‡β–…β–†β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–ˆβ–ˆβ–‡β–ˆβ–‡β–†β–‡β–‡β–†β–…β–…β–„β–„β–†β–…β–…β–„β–…β–„β–…β–† β–ˆ
  1.54 ms      Histogram: log(frequency) by time      4.84 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

**Turbo:**
BenchmarkTools.Trial: 2881 samples with 1 evaluation.
 Range (min … max):  1.372 ms …  31.307 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     1.633 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.721 ms Β± 626.874 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–†β–ˆβ–†β–‚β–β–‚β–‚
  β–ƒβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–…β–‡β–‡β–†β–…β–…β–…β–…β–†β–„β–†β–…β–…β–†β–…β–„β–ƒβ–„β–„β–ƒβ–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–‚β–ƒβ–‚β–β–‚β–‚β–‚β–‚β–β–‚β–‚β–β–β–β–β–β– β–ƒ
  1.37 ms         Histogram: frequency by time        2.54 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

**Turbo Manual Unroll:**
BenchmarkTools.Trial: 3856 samples with 1 evaluation.
 Range (min … max):  984.900 ΞΌs …   2.480 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):       1.219 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):     1.288 ms Β± 238.877 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   β–ƒβ–†β–†β–ˆβ–†β–‡β–†β–…β–…β–„β–ƒβ–…β–‚β–ƒβ– ▁
  β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–‡β–ˆβ–†β–†β–†β–‡β–†β–†β–…β–†β–†β–…β–…β–„β–…β–…β–„β–„β–„β–„β–„β–ƒβ–„β–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–ƒβ–‚β–ƒβ–‚β–‚β–β–‚β–‚β–‚β–‚ β–…
  985 ΞΌs           Histogram: frequency by time         2.07 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

And I see that while @tturbo provides a performance boost in both cases, in the case where I manually unroll the loop compared to looping over the dimension, d, then I lose out on 20ish percent performance.

Is there a way in which I can get the highest speed possible while not having to manually unroll the loop?

Thanks!

Hi there :slight_smile:

I am not sure LoopVectorization can help you much here (but I am not an LV expert!). I suggest you write

function updateV2!(result::DimensionalData, data::DimensionalData, I, J)
    @inbounds for iter ∈ eachindex(I,J)
        i, j = I[iter], J[iter]
        result.V[iter] = data.V[i] - data.V[j]  # Compute the difference for the d-th dimension
    end
end

This effectively unrolls the computation inside the loop because you subtract 2 StaticArrays. You can try and change @inbounds to @tturbo but that gives a warning from LoopVectorization that it does not understand the arguments.

Swapping around the loop which I tried first

@tturbo for iter ∈ eachindex(I,J)
    for d ∈ 1:length(result.vectors)

Also does not work. I think LV has trouble understanding this doubled indexing result.vectors[d][iter] but again I am not a LV expert.

Small comment aside: It is (imo) a bit easier to read if you write NTuple{N,T} instead of Tuple{Vararg{T,N}}. So you could change the fieldtypes to

struct DimensionalData{D, T}
    vectors::NTuple{D, Vector{T}}
    V::StructArray{SVector{D, T}, 1, NTuple{D, Vector{T}}}

You mean the manual is faster, and you’re asking how to avoid it?
It is faster because it is @tturbo and multithreading once.
Also, did you start Julia with only a single thread?
You’re using @tturbo, which is multithreaded but requires you to start Julia with multiple, e.g. julia -tauto. You can check with Threads.nthreads().
I get

Naive:
BenchmarkTools.Trial: 562 samples with 1 evaluation.
 Range (min … max):  1.593 ms …   5.570 ms  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     1.721 ms               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   1.775 ms Β± 328.642 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

    β–„β–ˆβ–…                                                        
  β–„β–„β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–„β–…β–β–β–„β–„β–β–β–β–β–β–β–β–β–†β–β–β–β–β–„β–…β–β–„β–β–β–β–„β–β–β–β–β–β–„β–β–β–β–β–β–β–β–β–β–β–„β–β–β–β–β–β–β–β–„ β–†
  1.59 ms      Histogram: log(frequency) by time      3.67 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
Turbo:
BenchmarkTools.Trial: 7191 samples with 1 evaluation.
 Range (min … max):  127.183 ΞΌs … 334.352 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     128.464 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   135.318 ΞΌs Β±  14.360 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ˆβ–…β–  ▁▃▃▂▂▁▁▄▄▃▁ ▁▂▂▁           ▁▁▁                           ▁
  β–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–†β–†β–‡β–†β–…β–‡β–†β–…β–ƒβ–…β–‚β–ƒβ–„β–‚β–„β–ƒβ–‚β–ƒβ–ƒβ–„β–‚β–‚β–ƒβ–„β–„ β–ˆ
  127 ΞΌs        Histogram: log(frequency) by time        179 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Turbo Manual Unroll:
BenchmarkTools.Trial: 6883 samples with 1 evaluation.
 Range (min … max):  135.449 ΞΌs … 339.788 ΞΌs  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     136.725 ΞΌs               β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   141.212 ΞΌs Β±   9.045 ΞΌs  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

  β–ƒβ–ˆβ–‡β–‚    ▁▂▂▃▃▃▃▃▃▃▃▂▂▂▂▁                 ▁▁ ▁                 ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–ˆβ–ˆβ–ˆβ–‡β–‡β–ˆβ–ˆβ–‡β–‡β–‡β–†β–„β–†β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–†β–…β–†β–†β–†β–†β–†β–‡β–†β–…β–† β–ˆ
  135 ΞΌs        Histogram: log(frequency) by time        166 ΞΌs <

 Memory estimate: 0 bytes, allocs estimate: 0.

from your example.

It does not understand it, but in this case because the for d loop is outside @tturbo, it considers d a constant w/ respect to the loop and hoists result.vectors[d] out.

You can use Base.Cartesian.@nexpr to manually unroll more conveniently:

function updateVTManual!(result::DimensionalData, data::DimensionalData, I, J)
        @tturbo for iter ∈ eachindex(I,J)
            i, j = I[iter], J[iter]
            Base.Cartesian.@nexpr 2 d -> begin
              result.vectors[d][iter] = data.vectors[d][i] - data.vectors[d][j]  # Compute the difference for the d-th dimension
            end
        end
end

and this should work like the Manual while staying DRY.

1 Like

For this statement I swapped the loop over d to the inside:

function updateVT_swap!(result::DimensionalData, data::DimensionalData, I, J)
    @tturbo for iter ∈ eachindex(I,J)
        for d ∈ 1:length(result.vectors)
            i, j = I[iter], J[iter]
            result.vectors[d][iter] = data.vectors[d][i] - data.vectors[d][j]  # Compute the difference for the d-th dimension
        end
    end
end

But this raises an UndefVarError: `d` not defined

Also I think this will not work. If I understand @Ahmed_Salih correctly then the number of dimensions can be 1,2 or 3 while your example hardcodes it to 2. The hand unrolled example is a bit misleading and should only demonstrate that unrolling helps.

I think you might just have saved my data in regards to Julia, I was in the process of flattening vectors trying to get it to work… :sweat_smile:

As @abraemer is saying, if I can get the β€œ2” automized based on dims 1, 2 or 3, then I am set :slight_smile:

In worse case I do multiple dispatch and use the same code, but have that expression hardcoded - that would triple my code base though…

Kind regards

Yes, it does not understand it. I was saying it worked only because it was outside, and could thus be handled by treating it as a single index.

You can use @generated with Base.Cartesian to interpolate values.

1 Like

Something like this should work (I did not test):

@generated function updateVTManual!(result::DimensionalData{D}, data::DimensionalData, I, J) where {D}
    quote
        @tturbo for iter ∈ eachindex(I,J)
            i, j = I[iter], J[iter]
            Base.Cartesian.@nexpr $D d -> begin
              result.vectors[d][iter] = data.vectors[d][i] - data.vectors[d][j]  # Compute the difference for the d-th dimension
            end
        end
    end
end

You could also take an approach more like $([:(result.vectors[$d][iter] = data.vectors[$d][i] - data.vectors[$d][j]) for d in 1:D]...) instead of Base.Cartesian, which is a common pattern in StaticArrays.jl, but that’s less readable IMO.

2 Likes

Ah I see. Thanks for clarification :slight_smile:

I would suggest a slight modification as the information is already in the type of DimensionalData:

@generated function updateVTManual!(result::DimensionalData{D}, data::DimensionalData, I, J) where {D}
    quote
        @tturbo for iter ∈ eachindex(I,J)
            i, j = I[iter], J[iter]
            Base.Cartesian.@nexpr $D d -> begin
              result.vectors[d][iter] = data.vectors[d][i] - data.vectors[d][j]  # Compute the difference for the d-th dimension
            end
        end
    end
end
2 Likes

Yes, definitely do that. I’m editing my comment to reflect this (the original version passed a Val{D}, which you can use if you can’t get the parameter from elsewhere).

2 Likes

Thank you to both of you @Elrod and @abraemer !

I got it to work and performed the benchmarks. Similar performance:

And today I learned about @generated understood when it can be needed and similarily with Base.Cartesian.@nexprs - I actually need this for something else as well, so so happy to have seen it.

In the bottom I basically check that the result calculated with the naive solution is equal to this automated one. Full script below:

Naive test script
using StaticArrays
using LoopVectorization
using StructArrays
using BenchmarkTools

struct DimensionalData{D, T}
    vectors::Tuple{Vararg{Vector{T}, D}}
    V::StructArray{SVector{D, T}, 1, Tuple{Vararg{Vector{T}, D}}}

    # General constructor for vectors
    function DimensionalData(vectors::Vector{T}...) where {T}
        D = length(vectors)
        V = StructArray{SVector{D, T}}(vectors)
        new{D, T}(Tuple(vectors), V)
    end

    # Constructor for initializing with all zeros, adapting to dimension D
    function DimensionalData{D, T}(len::Int) where {D, T}
        vectors = ntuple(d -> zeros(T, len), D) # Create D vectors of zeros
        V = StructArray{SVector{D, T}}(vectors)
        new{D, T}(vectors, V)
    end
end

function flatten(data::DimensionalData)
    flatVector = Vector{eltype(data.V)}()  # Initialize an empty vector of the appropriate type
    for vec in data.vectors
        append!(flatVector, vec)  # Append each vector's contents to the flatVector
    end
    return flatVector
end


function updateV!(result::DimensionalData, data::DimensionalData, I, J)
    for d ∈ 1:length(result.vectors)
        for iter ∈ eachindex(I,J)
            i, j = I[iter], J[iter]
            result.vectors[d][iter] = data.vectors[d][i] - data.vectors[d][j]  # Compute the difference for the d-th dimension
        end
    end
end

function updateVT!(result::DimensionalData, data::DimensionalData, I, J)
    for d ∈ 1:length(result.vectors)
        @tturbo for iter ∈ eachindex(I,J)
            i, j = I[iter], J[iter]
            result.vectors[d][iter] = data.vectors[d][i] - data.vectors[d][j]  # Compute the difference for the d-th dimension
        end
    end
end

function updateVTManual!(result::DimensionalData, data::DimensionalData, I, J)
        @tturbo for iter ∈ eachindex(I,J)
            i, j = I[iter], J[iter]
            result.vectors[1][iter] = data.vectors[1][i] - data.vectors[1][j]  # Compute the difference for the d-th dimension
            result.vectors[2][iter] = data.vectors[2][i] - data.vectors[2][j]  # Compute the difference for the d-th dimension
        end
end

using LoopVectorization

# Define a generated function to dynamically create expressions based on D
@generated function updateVTAuto!(result::DimensionalData{D}, data::DimensionalData, I, J) where {D}
    quote
        @tturbo for iter ∈ eachindex(I,J)
            i, j = I[iter], J[iter]
            Base.Cartesian.@nexprs $D d -> begin
              result.vectors[d][iter] = data.vectors[d][i] - data.vectors[d][j]  # Compute the difference for the d-th dimension
            end
        end
    end
end


# Test performance
let
    D    = 2
    T    = Float64
    N    = 10000
    NL   = 500000
    data = DimensionalData(rand(N),rand(N))
    I    = rand(1:N, NL)
    J    = rand(1:N, NL)
    P    = DimensionalData{2,Float64}(NL)

    println("Naive:"); display(@benchmark updateV!($P,$data,$I,$J))
    println("Turbo:"); display(@benchmark updateVT!($P,$data,$I,$J))
    println("Turbo Manual Unroll:"); display(@benchmark updateVTManual!($P,$data,$I,$J))
    println("Turbo Auto Unroll:"); display(@benchmark updateVTAuto!($P,$data,$I,$J))
end

# Test correctness
let
    D     = 1
    T     = Float64
    N     = 10000
    NL    = 500000
    data  = DimensionalData(rand(N),rand(N))
    I     = rand(1:N, NL)
    J     = rand(1:N, NL)
    P1    = DimensionalData{2,Float64}(NL)
    P2    = DimensionalData{2,Float64}(NL)

    updateV!(P1,data,I,J)
    updateVTAuto!(P2,data,I,J)

    println("Dimension 1 is: ", P1.vectors[1] β‰ˆ P2.vectors[1])
    println("Dimension 2 is: ", P1.vectors[2] β‰ˆ P2.vectors[2])

    D     = 2
    T     = Float64
    N     = 10000
    NL    = 500000
    data  = DimensionalData(rand(N),rand(N))
    I     = rand(1:N, NL)
    J     = rand(1:N, NL)
    P1    = DimensionalData{2,Float64}(NL)
    P2    = DimensionalData{2,Float64}(NL)

    updateV!(P1,data,I,J)
    updateVTAuto!(P2,data,I,J)

    println("Dimension 1 is: ", P1.vectors[1] β‰ˆ P2.vectors[1])
    println("Dimension 2 is: ", P1.vectors[2] β‰ˆ P2.vectors[2])

    D     = 3
    T     = Float64
    N     = 10000
    NL    = 500000
    data  = DimensionalData(rand(N),rand(N))
    I     = rand(1:N, NL)
    J     = rand(1:N, NL)
    P1    = DimensionalData{2,Float64}(NL)
    P2    = DimensionalData{2,Float64}(NL)

    updateV!(P1,data,I,J)
    updateVTAuto!(P2,data,I,J)

    println("Dimension 1 is: ", P1.vectors[1] β‰ˆ P2.vectors[1])
    println("Dimension 2 is: ", P1.vectors[2] β‰ˆ P2.vectors[2])
end

This is what I enjoy with Julia, the way to express code is fairly intuitive. At the end of the day, even if a newcomer wouldn’t understand @generated and Base.Cartesian.@nexprs instantly, I think that the body of the loop is well understood, which is the nicest part :slight_smile:

Kind regards

1 Like

These timings are amazing!

Can I kindly ask what CPU you are using? I am on an older I7 with I believe 4 physical cores and I have in my VSCode settings the following:

"julia.NumThreads": 4,

And in the terminal it says:

julia> Threads.nthreads()
4

So I think I am starting Julia right, but my CPU is not able to take full advantage it seems - a 10x in time in your timings really makes me start wondering how fast my code actually is hehe

Kind regards

Also an older CPU (7th gen, September 2017), but one with 18 cores/36 threads:

julia> versioninfo()
Julia Version 1.10.1
Commit 7790d6f064 (2024-02-13 20:41 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 36 Γ— Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake-avx512)
Threads: 36 default, 0 interactive, 18 GC (on 36 virtual cores)

LV will only use up to 1 thread per core, so it’ll top out at 18 even though Threads.nthreads() == 36.

1 Like

Thanks for explaining! Seems like I should consider a CPU upgrade when possible, the harder it can work for me, the faster I can fail and try again! :sweat_smile:

Just wanted to show you the impact of the Base.Cartesian.@nexprs you showed me earlier today. It has helped me make generic code for either 2D or 3D particle simulations. For the 2D case, in the past I had modelled it as 3D out of necessity. Then a timing would look something like:

Going to 2D, reducing every computation by a 3rd, gives as one might expect about 1/3 in performance increase:

So basically going from 130 to 90 seconds by removing the unnecessary extra dimension in 2D!

Kind regards

2 Likes