High performance vector/matrix/tensor linear algebra operations

FireCrumb · October 14, 2022, 9:30pm

Hi, I have vectors (say 20 of them) of the same size (say 256 Float64).
Now I want to add two of them and put the result into the third one, with the index of which one given as input.

TL;DR: what is the fastest way to perform linear algebra operations on such vectors on CPU?

Sample code for adding two vectors:

vectors1 = Vector{Vector{Float64}}([zeros(Float64, 256) for _ = 1:20]);
@inline function add_vec(mem, out, in1, in2)
    @inbounds mem[out] .= mem[in1] .+ mem[in2]
end

When I check the performance, it’s unsatisfactory – because I’m doing this operation a lot so it adds up.

using BenchmarkTools
@benchmark add_vec(vectors1, 3, 1, 2)

BenchmarkTools.Trial: 10000 samples with 994 evaluations.
 Range (min … max):  30.430 ns … 95.037 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     31.739 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   31.967 ns ±  1.484 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▁▂▄▅▇███▇▅▄▄▃▂                                        ▂
  ▂▂▂▇▇▇███████████████▇▆▅▄▂▄▅▄▆▄▅▄▄▅▄▂▂▄▅▆▆▇▇█▇▇▇▇▇▆▆▆▅▄▅▅▄▅ █
  30.4 ns      Histogram: log(frequency) by time      36.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Although obviously can always be worse, this is what I started with:

function add_vec_unoptimized(mem, out, in1, in2)
    mem[out] = mem[in1] + mem[in2]
end

@benchmark add_vec_unoptimized(vectors1, 3, 1, 2)

BenchmarkTools.Trial: 10000 samples with 681 evaluations.
 Range (min … max):  164.420 ns …   4.327 μs  ┊ GC (min … max):  0.00% … 86.74%
 Time  (median):     214.507 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   304.479 ns ± 341.656 ns  ┊ GC (mean ± σ):  17.15% ± 13.98%

  ▄█▅▃▂▁        ▃▂                                              ▁
  ███████▇▇▆▆▆▇████▇▆▆▅▁▃▃▁▁▁▁▁▁▁▁▁▄▃▁▃▃▄▅▄▅▆▆▆▆▆▆▆▆▆▄▅▅▅▆▆▅▇▆▇ █
  164 ns        Histogram: log(frequency) by time       2.15 μs <

 Memory estimate: 2.12 KiB, allocs estimate: 1.

When looking at the underlying code using code_warntype and code_native, I see it’s doing lots of things internally – checking dimensions, materialize, broadcast (in the .+ version), axes, shape, convert, alias etc.

Tried also using StaticArrays.jl to hint the compiler about the vector size, but it was slower.

What is the correct way to do this?
I’m asking for this one operation for sake of simplicity and analysis, but I’m looking for a solution for all linear algebra operations.
Preferably single-threaded CPU.

Thanks!

Elrod · October 14, 2022, 10:21pm

Loops are going to be reliable. Fastbroadcast.jl should also shave off a little overhead and match the loop.

The loop would also make it easy to pass in static size information.

DNF · October 14, 2022, 10:54pm

What sort of performance are you expecting? 256 additions in 30ns seems really fast to me. Maybe there’s still more to be had, but probably not a lot.

Multi-threading a single vector addition won’t help for such short time spans, but you could parallelize different vector operations.

matthias314 · October 15, 2022, 1:31am

Just for the record: If you use $ in your benchmarks, then the time is reduced by 1/3:

julia> @btime add_vec(vectors1, 3, 1, 2);
  46.584 ns (0 allocations: 0 bytes)

julia> @btime add_vec($vectors1, 3, 1, 2);
  31.934 ns (0 allocations: 0 bytes)

Elrod · October 15, 2022, 2:23am

FWIW

julia> vectors1 = Vector{Vector{Float64}}([zeros(Float64, 256) for _ = 1:20]);

julia> @inline function add_vec_loop(mem, out, in1, in2)
           @inbounds begin
             mo = mem[out]; mi1 = mem[in1]; mi2 = mem[in2];
           end
           for i = eachindex(mo)
               @inbounds mo[i] = mi1[i] + mi2[i]
           end
       end
add_vec_loop (generic function with 1 method)

julia> @benchmark add_vec_loop($vectors1, 3, 1, 2)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  12.189 ns … 32.116 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     13.191 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.088 ns ±  0.500 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                            ▇  █▁
  ▂▂▅▆▃▃▂▂▂▂▂▂▂▂▁▂▃▃▃▂▂▃▄▂▂▄█▃▄██▂▂▁▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  12.2 ns         Histogram: frequency by time        14.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @inline function add_vec_fb(mem, out, in1, in2)
           @inbounds begin
             mo = mem[out]; mi1 = mem[in1]; mi2 = mem[in2];
           end
           @.. mo = mi1 + mi2
       end
add_vec_fb (generic function with 1 method)

julia> @benchmark add_vec_fb($vectors1, 3, 1, 2)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  12.191 ns … 39.547 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     12.462 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   12.492 ns ±  0.364 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▂▂ ▄█   ▁
  ▂▃▃▃▄▅▅██▆██▇▃▅█▆▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  12.2 ns         Histogram: frequency by time        13.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @inline function add_vec2(mem, out, in1, in2)
           @inbounds begin
             mo = mem[out]; mi1 = mem[in1]; mi2 = mem[in2];
           end
           @. mo = mi1 + mi2
       end
add_vec2 (generic function with 1 method)

julia> @benchmark add_vec2($vectors1, 3, 1, 2)
BenchmarkTools.Trial: 10000 samples with 998 evaluations.
 Range (min … max):  14.466 ns … 33.422 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     15.247 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   15.318 ns ±  0.419 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁         ▁▃▆██▇▇▅▂                       ▁                ▂
  ██▆▁▃▁▁▁▁▇██████████▅▁▁▁▁▃▃▁▃▄▃▆▇███▆▅▄▁▄▆▆██▅▆▅▃▃▃▅▆▅▄▅▆▅▅ █
  14.5 ns      Histogram: log(frequency) by time      17.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

I get a decent speedup from 15 → 13ns when switching from @. to FastBroadcast.@.. or a loop.

FireCrumb · January 20, 2023, 8:29am

Just back to update on the solution

Using LoopVectorization.jl

Example:
https://docs.juliahub.com/LoopVectorization/4TogI/0.12.147/examples/matrix_multiplication/

DNF · January 20, 2023, 8:58am

One thing I forgot to mention. In the above code you are taking a Vector{Vector{Float64}} and passing it into a Vector{Vector{Float64}} constructor. That is redundant, since [zeros(Float64, 256) for _ = 1:20] is already a Vector{Vector{Float64}}.

So you can just write

vectors1 = [zeros(Float64, 256) for _ = 1:20]

This is considered idiomatic.

FireCrumb · January 20, 2023, 9:01am

Thanks for the reminder about the data structure.

So also update about that – after investigation, much better performance is by using one big two dimensional MArray and subsetting it, rather than using a vector of vectors.

DNF · January 20, 2023, 9:06am

That is a bit surprising. MArrays are specifically intended for small arrays, and their performance degrades as their sizes grow – normally much beyond 100 elements is not worthwhile. Also, compilation time gets gradually worse.

FireCrumb · January 20, 2023, 9:12am

It’s true that after some size the performance degrades, but it seems to be way beyond 100 elements.

Probably worth checking on case by case basis.

Topic		Replies	Views
Multiply many-matrices by many-vectors Performance matlab , parallel , multithreading , tensors	33	6852	December 14, 2018
Summing matrix elements is >1000X slower than summing vector elements General Usage performance	8	1354	April 17, 2017
Vector of matrices vs. multidimensional arrays Performance	13	10829	January 17, 2019
Optimizing iteration over slices of multiple matrices Performance question , array	14	1941	December 30, 2018
@inbounds: is the compiler now so smart that this is no longer necessary? Performance	33	3012	July 16, 2018

High performance vector/matrix/tensor linear algebra operations

Related topics