Performance of Matrix{T} vs Vector{Vector{T}}

I come from a background in Python and C++ and I’m currently working on a project in Julia for performance comparisons. My task is to re-implement a scheme from Python in Julia and perform an performance analysis. In this Python scheme, most of the operations involves the numpy.ndarray type. Operations such as line extraction, cross products, and internal products. However, I’ve noticed that Julia uses a column-major approach for Arrays, which has raised some doubts about the best way to implement the data used in the original scheme.

For example: In the Python version, an array of coordinates for “n” points is used, with the array having a shape of (n, 3). Frequently, specific point coordinates are extracted, and operations like addition, subtraction, and cross products are performed. This is straightforward in Python due to numpy’s row-major nature. However, in Julia, I’m unsure about the best approach. Should I use a matrix of shape (n, 3) or maybe one of (3, n)? Are Vector{Vector{T}} much slower to deal with than matrices if I’m not doing a lot of specific matrix operations?

After conducting some initial testing, I found it much easier to work with Vector{Vector{T}} types in Julia for these cases, since the syntax for line extraction and ranges are really close to the python one. However, I’m concerned that this approach may come with a significant performance cost. I couldn’t find much information about this topic online or in the documentation so I’m hoping that someone here can shed some light on this issue.

Thank you very much.

1 Like

The canonical way is (3, n) or using a inner datatype.

using StaticArrays

data = rand(SVector{3, Float64}, 100)
data[10][3]

since Julia can store immutable structs in line this is effectively the same as (3, n) with the added benefits that you added semantic information (this is a 3-Vector that represents a point). You could also write your own struct Point.

Vector{Vector{T}} is indeed quite a bit slower, since it involves one additional de-reference for everything.

Lastly sometimes it is interesting to experiment with different data-layouts and the Julia package StructArrays let’s you do a transparent transform from an Array of Structs to a Structs of Arrays representation.

using StaticArrays
using StructArrays

data = rand(SVector{3, Float64}, 100) # AoS
data = StructArray(data) # now in SoA form
8 Likes

Tha’ts very interesting! Is this StaticArray type compatible with broadcasted “.” operations and linear algebra operations like cross and inner products?

Broadcast operations yes. Linear algebra operations over the whole matrix? That’s a bit tricker and needs some “reshape/reinterpret”

A = rand(SVector{3, Float64}, 100)
reshape(reinterpret(Float64, A), 3, 100) * rand(100) # matrix-vector

I double-checked and indeed that mat-vec ends up calling blas.

7 Likes

See also GitHub - JuliaArrays/HybridArrays.jl: Arrays with both statically and dynamically sized axes in Julia which supports things like linear algebra.

5 Likes

Yes.

julia> using StaticArrays, LinearAlgebra

julia> x = SVector(1,2,3)
3-element SVector{3, Int64} with indices SOneTo(3):
 1
 2
 3

julia> y = SVector(0,0,1)
3-element SVector{3, Int64} with indices SOneTo(3):
 0
 0
 1

julia> cross(x, y)
3-element SVector{3, Int64} with indices SOneTo(3):
  2
 -1
  0

julia> x'*y
3
1 Like

Not so! Many operations still work. You just needed the outer Array to have a compatible size (notice I inserted a leading dimension to A):

julia> A = rand(SVector{3, Float64}, 1, 100); b = rand(100);

julia> A*b
1-element Vector{SVector{3, Float64}}:
 [24.835637295742075, 22.17836643473468, 21.530191642823027]

julia> only(A*b)
3-element SVector{3, Float64} with indices SOneTo(3):
 24.835637295742075
 22.17836643473468
 21.530191642823027

Alternatively (avoiding the allocation):

julia> A = rand(SVector{3, Float64}, 100); b = rand(100);

julia> b'*A
3-element SVector{3, Float64} with indices SOneTo(3):
 23.455389909261946
 23.468785998190274
 22.176127262115138

Other operations like block-matrix multiplication (i.e., between Matrix{SMatrix{3,3,Float64,9}}) also work.

2 Likes

I do have some problems still:
Apparently, broadcasting over rows or columns in Julia Matrices is really slow. So if I want to do something like the cross-product column-wise of two (3, n) matrices A and B, doing cross.(A, B) won’t work and mapslices appears to be really slow on the CPU (and way harder to use then simply broadcasting over Vector{Vector{T}} or SVector{SVector{T}} for that matter).

What are my options then? Is my best option for convenience and performance to deal with matrices and for loop over columns? I’ve also noticed that construction of SVector and SMatrix is awfully slow, and the performance gains weren’t that much. Am I doing something terribly wrong?

Operations over a (3,N) matrix in Julia should be essentially identical in performance to those on numpy arrays with (N,3) layout.

SArrays are fast, maybe you are experiencing some TTFX?

I think we can help better if you provide us a code example of what you are seeing.

3 Likes

If you are contemplating actions like taking the cross product between matching columns of entire arrays, then what you really have is an array of arrays and not a matrix. Julia makes it much easier to use arrays-of-arrays than other tools like MATLAB and NumPy, who shoehorn these into matrices. And small SVectors are virtually unbeatable.

Can you show us what you’ve tried? Constructing a SArray is usually very fast, so there’s likely some type instability or other aspect of your use that could be easily improved if you can show what’s slow.

3 Likes

That is true, but you will end up in the general fallback instead of calling out to a fast BLAS library.

julia> b = rand(1024);

julia> A = rand(SVector{3, Float64}, 1, 1024);

julia> @benchmark $A*$b
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.813 μs …  3.632 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.817 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.821 μs ± 31.963 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▆▂▂▂▁                                                    ▁
  ████████▇██▇▆▇█▆▆▄▆▅▄▄▅▄▅▄▅▅▅▁▄▄▄▅▅▃▄▃▄▄▁▅▄▁▃▄▁▄▃▃▃▄▄▁▁▄▁▃ █
  2.81 μs      Histogram: log(frequency) by time     2.93 μs <

 Memory estimate: 80 bytes, allocs estimate: 1.

julia> @benchmark $(reshape(reinterpret(Float64, A),3,1024)) *$b
BenchmarkTools.Trial: 10000 samples with 166 evaluations.
 Range (min … max):  640.482 ns …   9.991 μs  ┊ GC (min … max): 0.00% … 90.81%
 Time  (median):     644.096 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   652.145 ns ± 159.082 ns  ┊ GC (mean ± σ):  0.41% ±  1.57%

    ▄▇█▆▅▄▃▂           ▂▃▃▄▃▃▃▂▂▃▂▂▂▁                           ▂
  ▅██████████▇▆▅▅▅▅▄▃▅▇██████████████████▇▇▇▇▅▆▆▅▆▇▆▆▆▅▆▅▄▅▄▅▄▅ █
  640 ns        Histogram: log(frequency) by time        683 ns <

 Memory estimate: 80 bytes, allocs estimate: 1.

I hope that in the long-run we can have a fast parallel BLAS implementation that “just works”, but we are not there yet.

1 Like

One size does not fit all:

julia> using StaticArrays

julia> using BenchmarkTools

julia> for n in (10,100,1000,10000,100000)
               b = randn(n);
               A = randn(SVector{3,Float64},size(b));
               @show n
               @btime *($b',$A)
               @btime *(reinterpret(reshape,Float64,$A),$b)
       end
n = 10
  12.012 ns (0 allocations: 0 bytes)
  67.111 ns (1 allocation: 80 bytes)
n = 100
  95.263 ns (0 allocations: 0 bytes)
  122.783 ns (1 allocation: 80 bytes)
n = 1000
  914.583 ns (0 allocations: 0 bytes)
  787.500 ns (1 allocation: 80 bytes)
n = 10000
  9.100 μs (0 allocations: 0 bytes)
  1.967 μs (1 allocation: 80 bytes)
n = 100000
  89.500 μs (0 allocations: 0 bytes)
  17.900 μs (1 allocation: 80 bytes)

(by the way, this is an excellent place for the reinterpret(reshape,...) method).

On my machine, up to roughly 500 SVector{3,Float64} the Vector{SVector} approach is faster (and up to ~100 it can be much faster). But you are correct that for large arrays the parallelism of BLAS enabled by reinterpret eventually overtakes the specialized SVector kernels, plateauing around 5x faster on my machine.

My point was mostly that reinterpret gymnastics are not strictly necessary. It’s possible to use many operations on the Array{SArray} form and sometimes this may be attractive for clarity reasons (and performance reasons, at modest sizes).

1 Like

So, after some testing, I’ve realized that only the first call of a huge SArray construction was slow, which may indicate some compilation issue?
Well, in retrospect, the best option so far in my performance testing was to use a Vector{SVector} type to do the manipulations I need for my project. It turned out to be very convenient as well, and even faster than the NumPy equivalent, so I’m satisfied. Thanks a lot for all the answers, it was really helpful! And seeing such an active community has made me enjoy the language a tad more.

4 Likes

An important point about StaticArrays, which it seems that you’ve caught on to: One reason they are fast is because most functions on them are compiled for precisely the size they are and with full loop unrolling. Note that this means that every differently sized StaticArray must be compiled individually. This is the most performant runtime option (mostly… very large code sizes resulting from large StaticArray operations can eventually hurt performance), but also means that large StaticArrays require a lot of work from the compiler.

A rough rule of thumb is that an array up to ~100 elements is a good candidate for StaticArrays, but for large arrays (especially >500 elements) the compiler effort becomes substantial (crushing, eventually) while the runtime benefit simultaneously vanishes. I would elaborate to say that this guideline should be more like 100 “primitive” elements (such as individual numbers), and that array-like elements might better be counted as their full size.

This is the reason that a bunch of 3-element vectors work well as a Vector{SVector{3,T}}. The compiler can hyper-optimize all the little SVector{3,T} operations while the outer Vector allows for arbitrary-length looping with minimal compiler effort and code.

1 Like

One cool thing is that you can use, with StaticArrays,

struct Point{T} <: FieldVector{3,T}
  x::T
  y::T
  z::T
end

v = [ rand(Point{Float64}) for _ in 1:1000 ]

and besides having all the facilities of the SVectors you can additionally access their coordinates by name, i. e. v[1].x, etc.

5 Likes