Repeated contractions - Vector-of-StaticArrays vs HybridArrays.jl, discrepancy in performance between Float64 vs Float32

jl_enthusiast · August 3, 2024, 9:36pm

Hi,

I am doing some pre-emptive benchmarking for a larger project. As part of it, I am checking the most optimal way to store N pre-computed 3x3 matrices which we use to contract N 3-length vectors with. I represent the matrix data in two ways, either as a vector-of-static-arrays or as using HybridArrays.jl, with the last dimension being the ‘dynamic’ one with length N.

using StaticArrays, BenchmarkTools, HybridArrays, LinearAlgebra, LoopVectorization, Tullio, InteractiveUtils

function multiplies(M::Vector,V::Vector,N::Int)

    T = eltype(M[1])
    O = Vector{T}(undef,N)
    #unlikely simd does anything here
    for i in 1:N
        @inbounds Lvec = V[i]
        @inbounds O[i] = Lvec' * (M[i]*Lvec)
    end
    return O
end

function multiplies(M::HybridArray,V::HybridArray,N::Int)
    T = eltype(V)
    O = Vector{T}(undef,N)
    #unlikely simd does anything here
    @views for i in 1:N
        @inbounds Lvec = V[:,i]
        @inbounds O[i] = Lvec' * (M[:,:,i] * Lvec)
    end
    return O
end

const N_vectors = 500000
const N_elems = 3

const T = Float32
const random_data = randn(T,(N_elems,N_elems,N_vectors))
const random_vector = randn(T,(N_elems,N_vectors))

const random_sdata = [SMatrix{N_elems,N_elems}(random_data[:,:,i]) for i in 1:N_vectors]
const random_svector= [SVector{N_elems}(random_vector[:,i]) for i in 1:N_vectors]
const random_ddata = HybridArray{Tuple{N_elems, N_elems,StaticArrays.Dynamic()}}(random_data)
const random_dvector= HybridArray{Tuple{N_elems,StaticArrays.Dynamic()}}(random_vector)

destination = Vector{T}(undef,N_vectors)

@inline f(V,M) = V'*(M*V)

@btime multiplies($random_sdata,$random_svector,$N_vectors)
@btime multiplies($random_ddata,$random_dvector,$N_vectors)

O1 = multiplies(random_sdata,random_svector,N_vectors)
O2 = multiplies(random_ddata,random_dvector,N_vectors)

#checking numerical correctness
println(all(O1.==O2))

println("LoopVectorization ops")
@btime vmap($f,$random_svector,$random_sdata)
@btime vmap!($f,$destination, $random_svector,$random_sdata)

For 32-bit floating point numbers, the benchmark yields (in order of Vector-of-staticarrays, HybridArrays, vmap, vmap!):

582.250 μs (2 allocations: 1.91 MiB)
799.291 μs (2 allocations: 1.91 MiB)
691.875 μs (2 allocations: 1.91 MiB)
691.291 μs (0 allocations: 0 bytes)

However, for Float64s, I get:

795.750 μs (2 allocations: 3.81 MiB)
882.792 μs (2 allocations: 3.81 MiB)
793.959 μs (2 allocations: 3.81 MiB)
807.125 μs (0 allocations: 0 bytes)

I.e. it seems that the implementation using HybridArrays.jl does not benefit as much as the Vector-of-StaticArrays from using Float32s (10% vs 40% speedup).

I had tried to inspect the function codes via @code_warntype but didn’t see anything red. I wonder what could be the source of this discrepancy - a colleague suggested it could be because of differing memory layouts but I feel a bit lost!

abraemer · August 4, 2024, 1:56am

Can’t really address your question as I don’t know anything about HybridArrays.jl. Sorry. But I have 2 minor comments about your code

Instead of this you use dot(V,M,V) (and similarly in your multiplies function) but I don’t know whether this makes a difference with StaticArrays.jl. With normal Vectors you’d save an intermediate allocation (which is already not present with StaticArrays as the benchmarks show).

Passing in the length of the Vectors is a bit un-julian. You could simply use

for i in eachindex(M, V)

That has 2 advantages: 1) it errors if the lengths don’t match and 2) it infers the @inbounds so you don’t need to write it

In fact you could probably replace the whole call multiplies(M,V,N) with just dot.(V,M,V). Or if you preallocated some output vector O you could do O .= dot.(V,M,V) Should be the same speed I think

jl_enthusiast · August 4, 2024, 10:51am

Thank you for reply!

I have added the dotted-dot call to the benchmark as:

@btime dot.($random_svector,$random_sdata,$random_svector)

Interestingly, this is noticeably slower. For vectors-of-statics, it has benchmarks of:

Float32: 2.076 ms (2 allocations: 1.91 MiB)

Float64: 2.341 ms (2 allocations: 3.81 MiB)

Furthermore it produces slightly different results (maximum difference the order of 1e-7 for Float32 and 1e-14 for Float64). I am not sure, but it seems that dot(…) is ordering the operations in a different way.

nsajko · August 4, 2024, 11:37am

Sometimes an unnecessary promotion to Float64 creeps up into a code base, because the default float literals are Float64 (e.g., 0.3).

jl_enthusiast · August 4, 2024, 4:33pm

I thought this was the case (still not really sure, to be honest), but @code_warntype isn’t showing any sign of Float64s when called with 32-bits (HybridArrays), neither is the allocation indicative of double memory usage.

Do you have any advice for trying to debug this further, especially in regards to catching promotions? I am a bit of a beginner.

nsajko · August 4, 2024, 5:01pm

Profiling is always good for looking into performance issues. It could also help with catching unnecessary promotion: as far as I remember the flamegraph doesn’t show the type signature of an entry (method), but what you can do is profile in (VS) Code, with @profview, then locate some basic arithmetic method that’s taking a significant amount of time, and jump to source (I think with CTRL+click). The method definition should be specific for either Float32 or Float64.

Elrod · August 4, 2024, 5:58pm

The autovectorizer refuses to work, vmap! hits a fallback, and these choke on memory bandwidth anyway.

I got it working here:

julia> random_ddata_p = permutedims(random_ddata, (3, 1, 2));

julia> random_dvector_p = permutedims(random_dvector, (2, 1));

julia> using LinuxPerf; foreachf(f::F, N, args::Vararg{Any,K}) where {F,K} = foreach(_->Base.donotdelete(f(args...)),1:N)

julia> multiplies_li(random_ddata_p,random_dvector_p,N_vectors) ≈ multiplies_turbo2(random_ddata_p,random_dvector_p,N_vectors) ≈ multiplies(random_ddata,random_dvector,N_vectors)
true

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults)" begin
           foreachf(multiplies_li, 1_000, random_ddata_p, random_dvector_p, N_vectors)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.21e+09  100.0%  #  3.3 cycles per ns
┌ instructions             1.10e+10  100.0%  #  3.4 insns per cycle
│ branch-instructions      5.05e+08  100.0%  #  4.6% of insns
└ branch-misses            8.11e+04  100.0%  #  0.0% of branch insns
┌ task-clock               9.69e+08  100.0%  # 968.9 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              1.22e+04  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults)" begin
           foreachf(multiplies_turbo2, 1_000, random_ddata_p, random_dvector_p, N_vectors)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               2.98e+09  100.0%  #  3.1 cycles per ns
┌ instructions             1.11e+09  100.0%  #  0.4 insns per cycle
│ branch-instructions      2.01e+07  100.0%  #  1.8% of insns
└ branch-misses            7.15e+04  100.0%  #  0.4% of branch insns
┌ task-clock               9.73e+08  100.0%  # 972.7 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              6.33e+04  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Notice the @turbo version has roughly 10x fewer instructions, but takes just as long, because it achieves a pitiful 0.4 instructions/clock cycle.

nsajko · August 4, 2024, 6:22pm

Sorry, there were some mistakes in the advice I gave:

It seems that @profview in VS Code is broken currently!? Anyway, it works fine for me if I use ProfileCanvas.jl and run from the REPL, instead of from VS Code. See the code below.

This, too, was incorrect, sorry. The flamegraph points to these:

github.com

JuliaLang/julia/blob/065d4567e771af1a642f14c482dd85b3240ec0f3/base/float.jl#L488-L496


      
          ## floating point arithmetic ##
          -(x::IEEEFloat) = neg_float(x)
          
          +(x::T, y::T) where {T<:IEEEFloat} = add_float(x, y)
          -(x::T, y::T) where {T<:IEEEFloat} = sub_float(x, y)
          *(x::T, y::T) where {T<:IEEEFloat} = mul_float(x, y)
          /(x::T, y::T) where {T<:IEEEFloat} = div_float(x, y)
          
          muladd(x::T, y::T, z::T) where {T<:IEEEFloat} = muladd_float(x, y, z)

… they’re defined for IEEEFloat, which can be either Float32 or Float64:

github.com

JuliaLang/julia/blob/065d4567e771af1a642f14c482dd85b3240ec0f3/base/float.jl#L3-L3


      
          const IEEEFloat = Union{Float16, Float32, Float64}

So, sadly, the flamegraph seems useless for your question. In any case, this is the code for profiling:

using StaticArrays, HybridArrays, LinearAlgebra, ProfileCanvas

function multiplies(M::HybridArray, V::HybridArray, N::Int)
    T = eltype(V)
    O = Vector{T}(undef,N)
    @views for i in 1:N
        @inbounds Lvec = V[:,i]
        @inbounds O[i] = Lvec' * (M[:,:,i] * Lvec)
    end
    O
end

const N_vectors = 500000
const N_elems = 3

const T = Float32

const random_ddata = let random_data = randn(T,(N_elems,N_elems,N_vectors))
    HybridArray{Tuple{N_elems, N_elems,StaticArrays.Dynamic()}}(random_data)
end;

const random_dvector = let random_vector = randn(T,(N_elems,N_vectors))
    HybridArray{Tuple{N_elems,StaticArrays.Dynamic()}}(random_vector)
end;

# the useless iteration is just here to allow profiling to collect more data
function iterated_multiplies(M, V, n, m)
    r = multiplies(M, V, n)
    for _ ∈ 2:m
        r = multiplies(M, V, n)
    end
    r
end

@profview iterated_multiplies(random_ddata, random_dvector, N_vectors, 10000)  # first get everything compiled
@profview iterated_multiplies(random_ddata, random_dvector, N_vectors, 10000)  # now profile for real

FWIW, this is a flamegraph produced by the above code:

FTR, it’s more useful as a Web page than as a static image, because then you can select a subtree.

This was with Julia v1.11.

jl_enthusiast · August 4, 2024, 9:41pm

Thank you very much for your reply!

May I ask for the function definitions of multiplies_turbo()? I assume it is a triple-loop but I am a bit perplexed how changing around the array dims helps if we are already looping in (presumably) correct order. Granted, its my first time working with HybridArrays.jl.

Elrod · August 4, 2024, 11:50pm

Sorry, forgot the code:

julia> O1 = Vector{T}(undef, N_vectors); O2 = similar(O1);

julia> random_ddata_p = permutedims(random_ddata, (3, 1, 2));

julia> random_dvector_p = permutedims(random_dvector, (2, 1));

julia> N_vectors
5000

julia> function multiplies_turbo1!(O,M,V,N::Int)
           T = eltype(V)
           #unlikely simd does anything here
           @turbo for i in 1:N
               #@inbounds Lvec = V[:,i]
               #@inbounds O[i] = Lvec' * (M[:,:,i] * Lvec)
               Oi = zero(eltype(O))
               for j in axes(V,1), k in axes(V,1)
                   Oi += V[j,i]*M[k,j,i]*V[k,i]
               end
               O[i] = Oi
           end
           return O
       end
multiplies_turbo1! (generic function with 1 method)

julia> function multiplies_turbo2!(O,M,V,N::Int)
           T = eltype(V)
           #unlikely simd does anything here
           @turbo for i in 1:N
               #@inbounds Lvec = V[:,i]
               #@inbounds O[i] = Lvec' * (M[:,:,i] * Lvec)
               Oi = zero(eltype(O))
               for j in axes(V,2), k in axes(V,2)
                   Oi += V[i,j]*M[i,k,j]*V[i,k]
               end
               O[i] = Oi
           end
           return O
       end
multiplies_turbo2! (generic function with 1 method)

julia> multiplies_turbo1!(O1,random_ddata,random_dvector,N_vectors) ≈ multiplies_turbo2!(O2,random_ddata_p,random_dvector_p,N_vectors) ≈ multiplies(random_ddata,random_dvector,N_vectors)
true

julia> @btime multiplies_turbo1!($O1,$random_ddata,$random_dvector,$N_vectors);
  8.087 μs (0 allocations: 0 bytes)

julia> @btime multiplies_turbo2!($O2,$random_ddata_p,$random_dvector_p,$N_vectors);
  2.031 μs (0 allocations: 0 bytes)

Note, using the permuted memory layout makes it about 4x faster. LinuxPerf gives you a clue as to why:

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads),(cache-misses,cache-references)" begin
           foreachf(multiplies_turbo1!, 500_000, O1, random_ddata, random_dvector, N_vectors)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               1.72e+10   33.3%  #  4.0 cycles per ns
┌ instructions             2.51e+10   33.3%  #  1.5 insns per cycle
│ branch-instructions      8.35e+07   33.3%  #  0.3% of insns
└ branch-misses            7.17e+05   33.3%  #  0.9% of branch insns
┌ task-clock               4.31e+09  100.0%  #  4.3 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    2.04e+09   16.7%  # 19.8% of dcache loads
│ L1-dcache-loads          1.03e+10   16.7%
└ L1-icache-load-misses    3.93e+05   16.7%
┌ dTLB-load-misses         1.80e+01   16.7%  #  0.0% of dTLB loads
└ dTLB-loads               1.03e+10   16.7%
┌ iTLB-load-misses         1.44e+03   33.3%  # 13.0% of iTLB loads
└ iTLB-loads               1.10e+04   33.3%
┌ cache-misses             9.36e+03   33.3%  #  5.5% of cache refs
└ cache-references         1.70e+05   33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads),(cache-misses,cache-references)" begin
           foreachf(multiplies_turbo2!, 500_000, O2, random_ddata_p, random_dvector_p, N_vectors)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               4.20e+09   33.3%  #  4.0 cycles per ns
┌ instructions             4.57e+09   33.4%  #  1.1 insns per cycle
│ branch-instructions      2.49e+07   33.4%  #  0.5% of insns
└ branch-misses            3.72e+03   33.4%  #  0.0% of branch insns
┌ task-clock               1.05e+09  100.0%  #  1.1 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    2.04e+09   16.7%  # 81.7% of dcache loads
│ L1-dcache-loads          2.50e+09   16.7%
└ L1-icache-load-misses    5.42e+04   16.7%
┌ dTLB-load-misses         5.99e+01   16.7%  #  0.0% of dTLB loads
└ dTLB-loads               2.50e+09   16.7%
┌ iTLB-load-misses         2.73e+02   33.3%  #  7.3% of iTLB loads
└ iTLB-loads               3.74e+03   33.3%
┌ cache-misses             1.25e+03   33.2%  #  6.2% of cache refs
└ cache-references         2.02e+04   33.2%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> 2.51e10 / 4.57e9
5.49234135667396

Permuting the dimensions like I did reduces the number of instructions needed by a factor of about 5.5.

Note that I decreased the size of N_vectors to just 5000. With this:

julia> sizeof(T) * (length(random_ddata) + length(random_dvector) + length(O2)) / (1 << 20)
0.247955322265625

The combined size of the arrays is about 0.25 MiB, which is small enough to fit in my CPU’s 1 MiB L2 cache.
With the arrays this small, it isn’t as memory bottle-necked.
The IPC is still low, barely more than 1 in the SIMD case, with more than 80% of L1 cache misses.

Decreasing the array size down to 500, so that they fit in the L1 cache…

julia> @btime multiplies_turbo1!($O1,$random_ddata,$random_dvector,$N_vectors);
  838.746 ns (0 allocations: 0 bytes)

julia> @btime multiplies_turbo2!($O2,$random_ddata_p,$random_dvector_p,$N_vectors);
  144.273 ns (0 allocations: 0 bytes)

Now, v2 (with the permuted arrays) is 14x faster than before, or 1.4x faster relative to the number of elements.
The advantage of going to a smaller siz3e seems to be non-existent for already slower v1, which was already bottlenecked on the backend at 1.5 IPC when the array fit in the L2 cache (due to needing a lot of shuffle instructions, which are limited to 1/clock on my dekstop). It stayed at 1.5 IPC when in the L1 cache.
The permuted version, on the other hand, sped up from 1.1 IPC to 1.8 IPC. Again, it also needed many times fewer instructions to get the job done.

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads),(cache-misses,cache-references)" begin
           foreachf(multiplies_turbo1!, 5_000_000, O1, random_ddata, random_dvector, N_vectors)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               1.76e+10   33.3%  #  4.0 cycles per ns
┌ instructions             2.69e+10   33.3%  #  1.5 insns per cycle
│ branch-instructions      1.55e+08   33.3%  #  0.6% of insns
└ branch-misses            1.08e+04   33.3%  #  0.0% of branch insns
┌ task-clock               4.40e+09  100.0%  #  4.4 s
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    8.66e+08   16.7%  #  7.9% of dcache loads
│ L1-dcache-loads          1.09e+10   16.7%
└ L1-icache-load-misses    8.03e+05   16.7%
┌ dTLB-load-misses         7.20e+01   16.7%  #  0.0% of dTLB loads
└ dTLB-loads               1.10e+10   16.7%
┌ iTLB-load-misses         2.49e+04   33.3%  # 133.2% of iTLB loads
└ iTLB-loads               1.87e+04   33.3%
┌ cache-misses             2.38e+03   33.3%  # 20.1% of cache refs
└ cache-references         1.19e+04   33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads),(cache-misses,cache-references)" begin
           foreachf(multiplies_turbo2!, 5_000_000, O2, random_ddata_p, random_dvector_p, N_vectors)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               3.03e+09   33.2%  #  4.0 cycles per ns
┌ instructions             5.58e+09   33.4%  #  1.8 insns per cycle
│ branch-instructions      9.47e+07   33.4%  #  1.7% of insns
└ branch-misses            2.85e+03   33.4%  #  0.0% of branch insns
┌ task-clock               7.59e+08  100.0%  # 759.3 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    1.46e+08   16.7%  #  5.2% of dcache loads
│ L1-dcache-loads          2.80e+09   16.7%
└ L1-icache-load-misses    5.86e+04   16.7%
┌ dTLB-load-misses         2.99e+01   16.7%  #  0.0% of dTLB loads
└ dTLB-loads               2.81e+09   16.7%
┌ iTLB-load-misses         1.74e+03   33.3%  # 67.7% of iTLB loads
└ iTLB-loads               2.57e+03   33.3%
┌ cache-misses             4.01e+02   33.2%  # 11.8% of cache refs
└ cache-references         3.41e+03   33.2%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

So, no, you were not looping in the correct order.
Which order is “correct” is a little more complicated than “columns are fastest”.

In computing the quadratic form Lvec' * Mmat * Lvec, each operation depends on the previous.
We cannot parallelize the quadratic form.
However, we are doing a huge number of these quadratic forms.
Thus, to speed up the code, the obvious answer is to perform these in parallel via SIMD instructions.

Thus, we want to quickly load multiple Lvecs and Mmats at a time from memory.
That means, we want it to be fast to multiple Lvecs and Mmats. We want each particular element, e.g. Lvec[i] of the 1st, 2nd, 3rd… Lvec to be next to each other in memory.
We do not want Lvec[i] to be next to Lvec[i+1], as we don’t get to use those two elements in parallel anyway.

In v1, Lvec[i] is next to Lvec[i+1], even though we can’t use these two numbers in parallel. That is why it does a huge amount of work to permute them on the fly into v2’s order (and maybe it does that inefficiently).

Topic		Replies	Views
Strange performance of a loop Performance	37	3238	July 21, 2018
[ANN] HybridArrays.jl: arrays with both static and dynamic axes Package Announcements package , announcement	14	1364	December 1, 2020
Optimizing iteration over slices of multiple matrices Performance question , array	14	1896	December 30, 2018
Performance regression with StaticArrays? Performance question , staticarrays	5	469	January 27, 2023
Memory allocations when returning vectors General Usage array , memory-allocation	15	1491	June 6, 2018

Repeated contractions - Vector-of-StaticArrays vs HybridArrays.jl, discrepancy in performance between Float64 vs Float32

Related topics