High performance vector/matrix/tensor linear algebra operations

Hi, I have vectors (say 20 of them) of the same size (say 256 Float64).
Now I want to add two of them and put the result into the third one, with the index of which one given as input.

TL;DR: what is the fastest way to perform linear algebra operations on such vectors on CPU?

Sample code for adding two vectors:

vectors1 = Vector{Vector{Float64}}([zeros(Float64, 256) for _ = 1:20]);
@inline function add_vec(mem, out, in1, in2)
    @inbounds mem[out] .= mem[in1] .+ mem[in2]
end

When I check the performance, it’s unsatisfactory – because I’m doing this operation a lot so it adds up.

using BenchmarkTools
@benchmark add_vec(vectors1, 3, 1, 2)
BenchmarkTools.Trial: 10000 samples with 994 evaluations.
 Range (min … max):  30.430 ns … 95.037 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     31.739 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   31.967 ns Β±  1.484 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

        β–β–‚β–„β–…β–‡β–ˆβ–ˆβ–ˆβ–‡β–…β–„β–„β–ƒβ–‚                                        β–‚
  β–‚β–‚β–‚β–‡β–‡β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–…β–„β–‚β–„β–…β–„β–†β–„β–…β–„β–„β–…β–„β–‚β–‚β–„β–…β–†β–†β–‡β–‡β–ˆβ–‡β–‡β–‡β–‡β–‡β–†β–†β–†β–…β–„β–…β–…β–„β–… β–ˆ
  30.4 ns      Histogram: log(frequency) by time      36.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Although obviously can always be worse, this is what I started with:

function add_vec_unoptimized(mem, out, in1, in2)
    mem[out] = mem[in1] + mem[in2]
end

@benchmark add_vec_unoptimized(vectors1, 3, 1, 2)
BenchmarkTools.Trial: 10000 samples with 681 evaluations.
 Range (min … max):  164.420 ns …   4.327 ΞΌs  β”Š GC (min … max):  0.00% … 86.74%
 Time  (median):     214.507 ns               β”Š GC (median):     0.00%
 Time  (mean Β± Οƒ):   304.479 ns Β± 341.656 ns  β”Š GC (mean Β± Οƒ):  17.15% Β± 13.98%

  β–„β–ˆβ–…β–ƒβ–‚β–        β–ƒβ–‚                                              ▁
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‡β–‡β–†β–†β–†β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–‡β–†β–†β–…β–β–ƒβ–ƒβ–β–β–β–β–β–β–β–β–β–„β–ƒβ–β–ƒβ–ƒβ–„β–…β–„β–…β–†β–†β–†β–†β–†β–†β–†β–†β–†β–„β–…β–…β–…β–†β–†β–…β–‡β–†β–‡ β–ˆ
  164 ns        Histogram: log(frequency) by time       2.15 ΞΌs <

 Memory estimate: 2.12 KiB, allocs estimate: 1.

When looking at the underlying code using code_warntype and code_native, I see it’s doing lots of things internally – checking dimensions, materialize, broadcast (in the .+ version), axes, shape, convert, alias etc.

Tried also using StaticArrays.jl to hint the compiler about the vector size, but it was slower.

What is the correct way to do this?
I’m asking for this one operation for sake of simplicity and analysis, but I’m looking for a solution for all linear algebra operations.
Preferably single-threaded CPU.

Thanks!

Loops are going to be reliable. Fastbroadcast.jl should also shave off a little overhead and match the loop.

The loop would also make it easy to pass in static size information.

What sort of performance are you expecting? 256 additions in 30ns seems really fast to me. Maybe there’s still more to be had, but probably not a lot.

Multi-threading a single vector addition won’t help for such short time spans, but you could parallelize different vector operations.

Just for the record: If you use $ in your benchmarks, then the time is reduced by 1/3:

julia> @btime add_vec(vectors1, 3, 1, 2);
  46.584 ns (0 allocations: 0 bytes)

julia> @btime add_vec($vectors1, 3, 1, 2);
  31.934 ns (0 allocations: 0 bytes)

FWIW

julia> vectors1 = Vector{Vector{Float64}}([zeros(Float64, 256) for _ = 1:20]);

julia> @inline function add_vec_loop(mem, out, in1, in2)
           @inbounds begin
             mo = mem[out]; mi1 = mem[in1]; mi2 = mem[in2];
           end
           for i = eachindex(mo)
               @inbounds mo[i] = mi1[i] + mi2[i]
           end
       end
add_vec_loop (generic function with 1 method)

julia> @benchmark add_vec_loop($vectors1, 3, 1, 2)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  12.189 ns … 32.116 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     13.191 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   13.088 ns Β±  0.500 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

                            β–‡  β–ˆβ–
  β–‚β–‚β–…β–†β–ƒβ–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–ƒβ–ƒβ–ƒβ–‚β–‚β–ƒβ–„β–‚β–‚β–„β–ˆβ–ƒβ–„β–ˆβ–ˆβ–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  12.2 ns         Histogram: frequency by time        14.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @inline function add_vec_fb(mem, out, in1, in2)
           @inbounds begin
             mo = mem[out]; mi1 = mem[in1]; mi2 = mem[in2];
           end
           @.. mo = mi1 + mi2
       end
add_vec_fb (generic function with 1 method)

julia> @benchmark add_vec_fb($vectors1, 3, 1, 2)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  12.191 ns … 39.547 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     12.462 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   12.492 ns Β±  0.364 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

         β–‚β–‚ β–„β–ˆ   ▁
  β–‚β–ƒβ–ƒβ–ƒβ–„β–…β–…β–ˆβ–ˆβ–†β–ˆβ–ˆβ–‡β–ƒβ–…β–ˆβ–†β–‚β–‚β–β–‚β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚ β–ƒ
  12.2 ns         Histogram: frequency by time        13.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @inline function add_vec2(mem, out, in1, in2)
           @inbounds begin
             mo = mem[out]; mi1 = mem[in1]; mi2 = mem[in2];
           end
           @. mo = mi1 + mi2
       end
add_vec2 (generic function with 1 method)

julia> @benchmark add_vec2($vectors1, 3, 1, 2)
BenchmarkTools.Trial: 10000 samples with 998 evaluations.
 Range (min … max):  14.466 ns … 33.422 ns  β”Š GC (min … max): 0.00% … 0.00%
 Time  (median):     15.247 ns              β”Š GC (median):    0.00%
 Time  (mean Β± Οƒ):   15.318 ns Β±  0.419 ns  β”Š GC (mean Β± Οƒ):  0.00% Β± 0.00%

   ▁         β–β–ƒβ–†β–ˆβ–ˆβ–‡β–‡β–…β–‚                       ▁                β–‚
  β–ˆβ–ˆβ–†β–β–ƒβ–β–β–β–β–‡β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–…β–β–β–β–β–ƒβ–ƒβ–β–ƒβ–„β–ƒβ–†β–‡β–ˆβ–ˆβ–ˆβ–†β–…β–„β–β–„β–†β–†β–ˆβ–ˆβ–…β–†β–…β–ƒβ–ƒβ–ƒβ–…β–†β–…β–„β–…β–†β–…β–… β–ˆ
  14.5 ns      Histogram: log(frequency) by time      17.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

I get a decent speedup from 15 β†’ 13ns when switching from @. to FastBroadcast.@.. or a loop.

1 Like

Just back to update on the solution

Using LoopVectorization.jl

Example:
https://docs.juliahub.com/LoopVectorization/4TogI/0.12.147/examples/matrix_multiplication/

One thing I forgot to mention. In the above code you are taking a Vector{Vector{Float64}} and passing it into a Vector{Vector{Float64}} constructor. That is redundant, since [zeros(Float64, 256) for _ = 1:20] is already a Vector{Vector{Float64}}.

So you can just write

vectors1 = [zeros(Float64, 256) for _ = 1:20]

This is considered idiomatic.

Thanks for the reminder about the data structure.

So also update about that – after investigation, much better performance is by using one big two dimensional MArray and subsetting it, rather than using a vector of vectors.

That is a bit surprising. MArrays are specifically intended for small arrays, and their performance degrades as their sizes grow – normally much beyond 100 elements is not worthwhile. Also, compilation time gets gradually worse.

It’s true that after some size the performance degrades, but it seems to be way beyond 100 elements.

Probably worth checking on case by case basis.