I come from a background in Python and C++ and I’m currently working on a project in Julia for performance comparisons. My task is to re-implement a scheme from Python in Julia and perform an performance analysis. In this Python scheme, most of the operations involves the numpy.ndarray type. Operations such as line extraction, cross products, and internal products. However, I’ve noticed that Julia uses a column-major approach for Arrays, which has raised some doubts about the best way to implement the data used in the original scheme.
For example: In the Python version, an array of coordinates for “n” points is used, with the array having a shape of (n, 3). Frequently, specific point coordinates are extracted, and operations like addition, subtraction, and cross products are performed. This is straightforward in Python due to numpy’s row-major nature. However, in Julia, I’m unsure about the best approach. Should I use a matrix of shape (n, 3) or maybe one of (3, n)? Are Vector{Vector{T}} much slower to deal with than matrices if I’m not doing a lot of specific matrix operations?
After conducting some initial testing, I found it much easier to work with Vector{Vector{T}} types in Julia for these cases, since the syntax for line extraction and ranges are really close to the python one. However, I’m concerned that this approach may come with a significant performance cost. I couldn’t find much information about this topic online or in the documentation so I’m hoping that someone here can shed some light on this issue.
The canonical way is (3, n) or using a inner datatype.
using StaticArrays
data = rand(SVector{3, Float64}, 100)
data[10][3]
since Julia can store immutable structs in line this is effectively the same as (3, n) with the added benefits that you added semantic information (this is a 3-Vector that represents a point). You could also write your own struct Point.
Vector{Vector{T}} is indeed quite a bit slower, since it involves one additional de-reference for everything.
Lastly sometimes it is interesting to experiment with different data-layouts and the Julia package StructArrays let’s you do a transparent transform from an Array of Structs to a Structs of Arrays representation.
using StaticArrays
using StructArrays
data = rand(SVector{3, Float64}, 100) # AoS
data = StructArray(data) # now in SoA form
Tha’ts very interesting! Is this StaticArray type compatible with broadcasted “.” operations and linear algebra operations like cross and inner products?
julia> using StaticArrays, LinearAlgebra
julia> x = SVector(1,2,3)
3-element SVector{3, Int64} with indices SOneTo(3):
1
2
3
julia> y = SVector(0,0,1)
3-element SVector{3, Int64} with indices SOneTo(3):
0
0
1
julia> cross(x, y)
3-element SVector{3, Int64} with indices SOneTo(3):
2
-1
0
julia> x'*y
3
Not so! Many operations still work. You just needed the outer Array to have a compatible size (notice I inserted a leading dimension to A):
julia> A = rand(SVector{3, Float64}, 1, 100); b = rand(100);
julia> A*b
1-element Vector{SVector{3, Float64}}:
[24.835637295742075, 22.17836643473468, 21.530191642823027]
julia> only(A*b)
3-element SVector{3, Float64} with indices SOneTo(3):
24.835637295742075
22.17836643473468
21.530191642823027
Alternatively (avoiding the allocation):
julia> A = rand(SVector{3, Float64}, 100); b = rand(100);
julia> b'*A
3-element SVector{3, Float64} with indices SOneTo(3):
23.455389909261946
23.468785998190274
22.176127262115138
Other operations like block-matrix multiplication (i.e., between Matrix{SMatrix{3,3,Float64,9}}) also work.
I do have some problems still:
Apparently, broadcasting over rows or columns in Julia Matrices is really slow. So if I want to do something like the cross-product column-wise of two (3, n) matrices A and B, doing cross.(A, B) won’t work and mapslices appears to be really slow on the CPU (and way harder to use then simply broadcasting over Vector{Vector{T}} or SVector{SVector{T}} for that matter).
What are my options then? Is my best option for convenience and performance to deal with matrices and for loop over columns? I’ve also noticed that construction of SVector and SMatrix is awfully slow, and the performance gains weren’t that much. Am I doing something terribly wrong?
If you are contemplating actions like taking the cross product between matching columns of entire arrays, then what you really have is an array of arrays and not a matrix. Julia makes it much easier to use arrays-of-arrays than other tools like MATLAB and NumPy, who shoehorn these into matrices. And small SVectors are virtually unbeatable.
Can you show us what you’ve tried? Constructing a SArray is usually very fast, so there’s likely some type instability or other aspect of your use that could be easily improved if you can show what’s slow.
julia> using StaticArrays
julia> using BenchmarkTools
julia> for n in (10,100,1000,10000,100000)
b = randn(n);
A = randn(SVector{3,Float64},size(b));
@show n
@btime *($b',$A)
@btime *(reinterpret(reshape,Float64,$A),$b)
end
n = 10
12.012 ns (0 allocations: 0 bytes)
67.111 ns (1 allocation: 80 bytes)
n = 100
95.263 ns (0 allocations: 0 bytes)
122.783 ns (1 allocation: 80 bytes)
n = 1000
914.583 ns (0 allocations: 0 bytes)
787.500 ns (1 allocation: 80 bytes)
n = 10000
9.100 μs (0 allocations: 0 bytes)
1.967 μs (1 allocation: 80 bytes)
n = 100000
89.500 μs (0 allocations: 0 bytes)
17.900 μs (1 allocation: 80 bytes)
(by the way, this is an excellent place for the reinterpret(reshape,...) method).
On my machine, up to roughly 500 SVector{3,Float64} the Vector{SVector} approach is faster (and up to ~100 it can be much faster). But you are correct that for large arrays the parallelism of BLAS enabled by reinterpret eventually overtakes the specialized SVector kernels, plateauing around 5x faster on my machine.
My point was mostly that reinterpret gymnastics are not strictly necessary. It’s possible to use many operations on the Array{SArray} form and sometimes this may be attractive for clarity reasons (and performance reasons, at modest sizes).
So, after some testing, I’ve realized that only the first call of a huge SArray construction was slow, which may indicate some compilation issue?
Well, in retrospect, the best option so far in my performance testing was to use a Vector{SVector} type to do the manipulations I need for my project. It turned out to be very convenient as well, and even faster than the NumPy equivalent, so I’m satisfied. Thanks a lot for all the answers, it was really helpful! And seeing such an active community has made me enjoy the language a tad more.
An important point about StaticArrays, which it seems that you’ve caught on to: One reason they are fast is because most functions on them are compiled for precisely the size they are and with full loop unrolling. Note that this means that every differently sized StaticArray must be compiled individually. This is the most performant runtime option (mostly… very large code sizes resulting from large StaticArray operations can eventually hurt performance), but also means that large StaticArrays require a lot of work from the compiler.
A rough rule of thumb is that an array up to ~100 elements is a good candidate for StaticArrays, but for large arrays (especially >500 elements) the compiler effort becomes substantial (crushing, eventually) while the runtime benefit simultaneously vanishes. I would elaborate to say that this guideline should be more like 100 “primitive” elements (such as individual numbers), and that array-like elements might better be counted as their full size.
This is the reason that a bunch of 3-element vectors work well as a Vector{SVector{3,T}}. The compiler can hyper-optimize all the little SVector{3,T} operations while the outer Vector allows for arbitrary-length looping with minimal compiler effort and code.