# Repeated contractions - Vector-of-StaticArrays vs HybridArrays.jl, discrepancy in performance between Float64 vs Float32

Hi,

I am doing some pre-emptive benchmarking for a larger project. As part of it, I am checking the most optimal way to store N pre-computed 3x3 matrices which we use to contract N 3-length vectors with. I represent the matrix data in two ways, either as a vector-of-static-arrays or as using HybridArrays.jl, with the last dimension being the βdynamicβ one with length N.

``````using StaticArrays, BenchmarkTools, HybridArrays, LinearAlgebra, LoopVectorization, Tullio, InteractiveUtils

function multiplies(M::Vector,V::Vector,N::Int)

T = eltype(M[1])
O = Vector{T}(undef,N)
#unlikely simd does anything here
for i in 1:N
@inbounds Lvec = V[i]
@inbounds O[i] = Lvec' * (M[i]*Lvec)
end
return O
end

function multiplies(M::HybridArray,V::HybridArray,N::Int)
T = eltype(V)
O = Vector{T}(undef,N)
#unlikely simd does anything here
@views for i in 1:N
@inbounds Lvec = V[:,i]
@inbounds O[i] = Lvec' * (M[:,:,i] * Lvec)
end
return O
end

const N_vectors = 500000
const N_elems = 3

const T = Float32
const random_data = randn(T,(N_elems,N_elems,N_vectors))
const random_vector = randn(T,(N_elems,N_vectors))

const random_sdata = [SMatrix{N_elems,N_elems}(random_data[:,:,i]) for i in 1:N_vectors]
const random_svector= [SVector{N_elems}(random_vector[:,i]) for i in 1:N_vectors]
const random_ddata = HybridArray{Tuple{N_elems, N_elems,StaticArrays.Dynamic()}}(random_data)
const random_dvector= HybridArray{Tuple{N_elems,StaticArrays.Dynamic()}}(random_vector)

destination = Vector{T}(undef,N_vectors)

@inline f(V,M) = V'*(M*V)

@btime multiplies(\$random_sdata,\$random_svector,\$N_vectors)
@btime multiplies(\$random_ddata,\$random_dvector,\$N_vectors)

O1 = multiplies(random_sdata,random_svector,N_vectors)
O2 = multiplies(random_ddata,random_dvector,N_vectors)

#checking numerical correctness
println(all(O1.==O2))

println("LoopVectorization ops")
@btime vmap(\$f,\$random_svector,\$random_sdata)
@btime vmap!(\$f,\$destination, \$random_svector,\$random_sdata)
``````

For 32-bit floating point numbers, the benchmark yields (in order of Vector-of-staticarrays, HybridArrays, vmap, vmap!):

582.250 ΞΌs (2 allocations: 1.91 MiB)
799.291 ΞΌs (2 allocations: 1.91 MiB)
691.875 ΞΌs (2 allocations: 1.91 MiB)
691.291 ΞΌs (0 allocations: 0 bytes)

However, for Float64s, I get:

795.750 ΞΌs (2 allocations: 3.81 MiB)
882.792 ΞΌs (2 allocations: 3.81 MiB)
793.959 ΞΌs (2 allocations: 3.81 MiB)
807.125 ΞΌs (0 allocations: 0 bytes)

I.e. it seems that the implementation using HybridArrays.jl does not benefit as much as the Vector-of-StaticArrays from using Float32s (10% vs 40% speedup).

I had tried to inspect the function codes via @code_warntype but didnβt see anything red. I wonder what could be the source of this discrepancy - a colleague suggested it could be because of differing memory layouts but I feel a bit lost!

Instead of this you use `dot(V,M,V)` (and similarly in your `multiplies` function) but I donβt know whether this makes a difference with StaticArrays.jl. With normal Vectors youβd save an intermediate allocation (which is already not present with StaticArrays as the benchmarks show).

Passing in the length of the Vectors is a bit un-julian. You could simply use

``````for i in eachindex(M, V)
``````

That has 2 advantages: 1) it errors if the lengths donβt match and 2) it infers the `@inbounds` so you donβt need to write it

In fact you could probably replace the whole call `multiplies(M,V,N)` with just `dot.(V,M,V)`. Or if you preallocated some output vector `O` you could do `O .= dot.(V,M,V)` Should be the same speed I think

2 Likes

I have added the dotted-dot call to the benchmark as:

``````@btime dot.(\$random_svector,\$random_sdata,\$random_svector)
``````

Interestingly, this is noticeably slower. For vectors-of-statics, it has benchmarks of:

Float32: 2.076 ms (2 allocations: 1.91 MiB)

Float64: 2.341 ms (2 allocations: 3.81 MiB)

Furthermore it produces slightly different results (maximum difference the order of 1e-7 for Float32 and 1e-14 for Float64). I am not sure, but it seems that dot(β¦) is ordering the operations in a different way.

1 Like

Sometimes an unnecessary promotion to `Float64` creeps up into a code base, because the default float literals are `Float64` (e.g., `0.3`).

2 Likes

I thought this was the case (still not really sure, to be honest), but @code_warntype isnβt showing any sign of Float64s when called with 32-bits (HybridArrays), neither is the allocation indicative of double memory usage.

Do you have any advice for trying to debug this further, especially in regards to catching promotions? I am a bit of a beginner.

1 Like

Profiling is always good for looking into performance issues. It could also help with catching unnecessary promotion: as far as I remember the flamegraph doesnβt show the type signature of an entry (method), but what you can do is profile in (VS) Code, with `@profview`, then locate some basic arithmetic method thatβs taking a significant amount of time, and jump to source (I think with CTRL+click). The method definition should be specific for either `Float32` or `Float64`.

1 Like

The autovectorizer refuses to work, `vmap!` hits a fallback, and these choke on memory bandwidth anyway.

I got it working here:

``````julia> random_ddata_p = permutedims(random_ddata, (3, 1, 2));

julia> random_dvector_p = permutedims(random_dvector, (2, 1));

julia> using LinuxPerf; foreachf(f::F, N, args::Vararg{Any,K}) where {F,K} = foreach(_->Base.donotdelete(f(args...)),1:N)

julia> multiplies_li(random_ddata_p,random_dvector_p,N_vectors) β multiplies_turbo2(random_ddata_p,random_dvector_p,N_vectors) β multiplies(random_ddata,random_dvector,N_vectors)
true

foreachf(multiplies_li, 1_000, random_ddata_p, random_dvector_p, N_vectors)
end
βββββββββββββββββββββββββββββββββββββββββββ
βΆ cpu-cycles               3.21e+09  100.0%  #  3.3 cycles per ns
β instructions             1.10e+10  100.0%  #  3.4 insns per cycle
β branch-instructions      5.05e+08  100.0%  #  4.6% of insns
β branch-misses            8.11e+04  100.0%  #  0.0% of branch insns
β task-clock               9.69e+08  100.0%  # 968.9 ms
β context-switches         0.00e+00  100.0%
β cpu-migrations           0.00e+00  100.0%
β page-faults              1.22e+04  100.0%
βββββββββββββββββββββββββββββββββββββββββββ

foreachf(multiplies_turbo2, 1_000, random_ddata_p, random_dvector_p, N_vectors)
end
βββββββββββββββββββββββββββββββββββββββββββ
βΆ cpu-cycles               2.98e+09  100.0%  #  3.1 cycles per ns
β instructions             1.11e+09  100.0%  #  0.4 insns per cycle
β branch-instructions      2.01e+07  100.0%  #  1.8% of insns
β branch-misses            7.15e+04  100.0%  #  0.4% of branch insns
β task-clock               9.73e+08  100.0%  # 972.7 ms
β context-switches         0.00e+00  100.0%
β cpu-migrations           0.00e+00  100.0%
β page-faults              6.33e+04  100.0%
βββββββββββββββββββββββββββββββββββββββββββ
``````

Notice the `@turbo` version has roughly 10x fewer instructions, but takes just as long, because it achieves a pitiful 0.4 instructions/clock cycle.

2 Likes

Sorry, there were some mistakes in the advice I gave:

It seems that `@profview` in VS Code is broken currently!? Anyway, it works fine for me if I use ProfileCanvas.jl and run from the REPL, instead of from VS Code. See the code below.

This, too, was incorrect, sorry. The flamegraph points to these:

β¦ theyβre defined for `IEEEFloat`, which can be either `Float32` or `Float64`:

So, sadly, the flamegraph seems useless for your question. In any case, this is the code for profiling:

``````using StaticArrays, HybridArrays, LinearAlgebra, ProfileCanvas

function multiplies(M::HybridArray, V::HybridArray, N::Int)
T = eltype(V)
O = Vector{T}(undef,N)
@views for i in 1:N
@inbounds Lvec = V[:,i]
@inbounds O[i] = Lvec' * (M[:,:,i] * Lvec)
end
O
end

const N_vectors = 500000
const N_elems = 3

const T = Float32

const random_ddata = let random_data = randn(T,(N_elems,N_elems,N_vectors))
HybridArray{Tuple{N_elems, N_elems,StaticArrays.Dynamic()}}(random_data)
end;

const random_dvector = let random_vector = randn(T,(N_elems,N_vectors))
HybridArray{Tuple{N_elems,StaticArrays.Dynamic()}}(random_vector)
end;

# the useless iteration is just here to allow profiling to collect more data
function iterated_multiplies(M, V, n, m)
r = multiplies(M, V, n)
for _ β 2:m
r = multiplies(M, V, n)
end
r
end

@profview iterated_multiplies(random_ddata, random_dvector, N_vectors, 10000)  # first get everything compiled
@profview iterated_multiplies(random_ddata, random_dvector, N_vectors, 10000)  # now profile for real
``````

FWIW, this is a flamegraph produced by the above code:

FTR, itβs more useful as a Web page than as a static image, because then you can select a subtree.

This was with Julia v1.11.

1 Like

May I ask for the function definitions of multiplies_turbo()? I assume it is a triple-loop but I am a bit perplexed how changing around the array dims helps if we are already looping in (presumably) correct order. Granted, its my first time working with HybridArrays.jl.

Sorry, forgot the code:

``````julia> O1 = Vector{T}(undef, N_vectors); O2 = similar(O1);

julia> random_ddata_p = permutedims(random_ddata, (3, 1, 2));

julia> random_dvector_p = permutedims(random_dvector, (2, 1));

julia> N_vectors
5000

julia> function multiplies_turbo1!(O,M,V,N::Int)
T = eltype(V)
#unlikely simd does anything here
@turbo for i in 1:N
#@inbounds Lvec = V[:,i]
#@inbounds O[i] = Lvec' * (M[:,:,i] * Lvec)
Oi = zero(eltype(O))
for j in axes(V,1), k in axes(V,1)
Oi += V[j,i]*M[k,j,i]*V[k,i]
end
O[i] = Oi
end
return O
end
multiplies_turbo1! (generic function with 1 method)

julia> function multiplies_turbo2!(O,M,V,N::Int)
T = eltype(V)
#unlikely simd does anything here
@turbo for i in 1:N
#@inbounds Lvec = V[:,i]
#@inbounds O[i] = Lvec' * (M[:,:,i] * Lvec)
Oi = zero(eltype(O))
for j in axes(V,2), k in axes(V,2)
Oi += V[i,j]*M[i,k,j]*V[i,k]
end
O[i] = Oi
end
return O
end
multiplies_turbo2! (generic function with 1 method)

julia> multiplies_turbo1!(O1,random_ddata,random_dvector,N_vectors) β multiplies_turbo2!(O2,random_ddata_p,random_dvector_p,N_vectors) β multiplies(random_ddata,random_dvector,N_vectors)
true

julia> @btime multiplies_turbo1!(\$O1,\$random_ddata,\$random_dvector,\$N_vectors);
8.087 ΞΌs (0 allocations: 0 bytes)

julia> @btime multiplies_turbo2!(\$O2,\$random_ddata_p,\$random_dvector_p,\$N_vectors);
2.031 ΞΌs (0 allocations: 0 bytes)
``````

Note, using the permuted memory layout makes it about 4x faster. LinuxPerf gives you a clue as to why:

``````julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads),(cache-misses,cache-references)" begin
foreachf(multiplies_turbo1!, 500_000, O1, random_ddata, random_dvector, N_vectors)
end
βββββββββββββββββββββββββββββββββββββββββββ
βΆ cpu-cycles               1.72e+10   33.3%  #  4.0 cycles per ns
β instructions             2.51e+10   33.3%  #  1.5 insns per cycle
β branch-instructions      8.35e+07   33.3%  #  0.3% of insns
β branch-misses            7.17e+05   33.3%  #  0.9% of branch insns
β task-clock               4.31e+09  100.0%  #  4.3 s
β context-switches         0.00e+00  100.0%
β cpu-migrations           0.00e+00  100.0%
β page-faults              0.00e+00  100.0%
β cache-misses             9.36e+03   33.3%  #  5.5% of cache refs
β cache-references         1.70e+05   33.3%
βββββββββββββββββββββββββββββββββββββββββββ

foreachf(multiplies_turbo2!, 500_000, O2, random_ddata_p, random_dvector_p, N_vectors)
end
βββββββββββββββββββββββββββββββββββββββββββ
βΆ cpu-cycles               4.20e+09   33.3%  #  4.0 cycles per ns
β instructions             4.57e+09   33.4%  #  1.1 insns per cycle
β branch-instructions      2.49e+07   33.4%  #  0.5% of insns
β branch-misses            3.72e+03   33.4%  #  0.0% of branch insns
β task-clock               1.05e+09  100.0%  #  1.1 s
β context-switches         0.00e+00  100.0%
β cpu-migrations           0.00e+00  100.0%
β page-faults              0.00e+00  100.0%
β cache-misses             1.25e+03   33.2%  #  6.2% of cache refs
β cache-references         2.02e+04   33.2%
βββββββββββββββββββββββββββββββββββββββββββ

julia> 2.51e10 / 4.57e9
5.49234135667396
``````

Permuting the dimensions like I did reduces the number of instructions needed by a factor of about 5.5.

Note that I decreased the size of `N_vectors` to just 5000. With this:

``````julia> sizeof(T) * (length(random_ddata) + length(random_dvector) + length(O2)) / (1 << 20)
0.247955322265625
``````

The combined size of the arrays is about 0.25 MiB, which is small enough to fit in my CPUβs 1 MiB L2 cache.
With the arrays this small, it isnβt as memory bottle-necked.
The IPC is still low, barely more than 1 in the SIMD case, with more than 80% of L1 cache misses.

Decreasing the array size down to 500, so that they fit in the L1 cacheβ¦

``````julia> @btime multiplies_turbo1!(\$O1,\$random_ddata,\$random_dvector,\$N_vectors);
838.746 ns (0 allocations: 0 bytes)

julia> @btime multiplies_turbo2!(\$O2,\$random_ddata_p,\$random_dvector_p,\$N_vectors);
144.273 ns (0 allocations: 0 bytes)
``````

Now, v2 (with the permuted arrays) is 14x faster than before, or 1.4x faster relative to the number of elements.
The advantage of going to a smaller siz3e seems to be non-existent for already slower v1, which was already bottlenecked on the backend at 1.5 IPC when the array fit in the L2 cache (due to needing a lot of shuffle instructions, which are limited to 1/clock on my dekstop). It stayed at 1.5 IPC when in the L1 cache.
The permuted version, on the other hand, sped up from 1.1 IPC to 1.8 IPC. Again, it also needed many times fewer instructions to get the job done.

``````julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads),(iTLB-load-misses,iTLB-loads),(cache-misses,cache-references)" begin
foreachf(multiplies_turbo1!, 5_000_000, O1, random_ddata, random_dvector, N_vectors)
end
βββββββββββββββββββββββββββββββββββββββββββ
βΆ cpu-cycles               1.76e+10   33.3%  #  4.0 cycles per ns
β instructions             2.69e+10   33.3%  #  1.5 insns per cycle
β branch-instructions      1.55e+08   33.3%  #  0.6% of insns
β branch-misses            1.08e+04   33.3%  #  0.0% of branch insns
β task-clock               4.40e+09  100.0%  #  4.4 s
β context-switches         0.00e+00  100.0%
β cpu-migrations           0.00e+00  100.0%
β page-faults              0.00e+00  100.0%
β cache-misses             2.38e+03   33.3%  # 20.1% of cache refs
β cache-references         1.19e+04   33.3%
βββββββββββββββββββββββββββββββββββββββββββ

foreachf(multiplies_turbo2!, 5_000_000, O2, random_ddata_p, random_dvector_p, N_vectors)
end
βββββββββββββββββββββββββββββββββββββββββββ
βΆ cpu-cycles               3.03e+09   33.2%  #  4.0 cycles per ns
β instructions             5.58e+09   33.4%  #  1.8 insns per cycle
β branch-instructions      9.47e+07   33.4%  #  1.7% of insns
β branch-misses            2.85e+03   33.4%  #  0.0% of branch insns
β task-clock               7.59e+08  100.0%  # 759.3 ms
β context-switches         0.00e+00  100.0%
β cpu-migrations           0.00e+00  100.0%
β page-faults              0.00e+00  100.0%
β cache-misses             4.01e+02   33.2%  # 11.8% of cache refs
β cache-references         3.41e+03   33.2%
βββββββββββββββββββββββββββββββββββββββββββ
``````

So, no, you were not looping in the correct order.
Which order is βcorrectβ is a little more complicated than βcolumns are fastestβ.

In computing the quadratic form `Lvec' * Mmat * Lvec`, each operation depends on the previous.
We cannot parallelize the quadratic form.
However, we are doing a huge number of these quadratic forms.
Thus, to speed up the code, the obvious answer is to perform these in parallel via SIMD instructions.

Thus, we want to quickly load multiple `Lvec`s and `Mmat`s at a time from memory.
That means, we want it to be fast to multiple `Lvec`s and `Mmat`s. We want each particular element, e.g. `Lvec[i]` of the 1st, 2nd, 3rdβ¦ `Lvec` to be next to each other in memory.
We do not want `Lvec[i]` to be next to `Lvec[i+1]`, as we donβt get to use those two elements in parallel anyway.

In v1, Lvec[i] is next to Lvec[i+1], even though we canβt use these two numbers in parallel. That is why it does a huge amount of work to permute them on the fly into v2βs order (and maybe it does that inefficiently).

5 Likes