Memory best practices in Julia with Arrays vs Vectors

ckawa · August 17, 2024, 1:59pm

I’m transitioning from MATLAB to Julia and I’m wondering if there are any best-practices for optimizing for memory and computational time when writing simulations.

Specifically, I’m designing a simple verlet integration for the motion of N particles where current_positions is a N by 3 matrix where each row represents particle n and each column represents the x,y,z position respectively.

A simplified form looks like this:

# Initialize a 3D array to hold number_of_steps slices of positions
number_of_steps = # some integer value
N = # number of particles
positions = Array{Float64}(undef, N, 3, number_of_steps)
current_positions = # N by 3 matrix of initial positions
steps = [] 
for current_step = 1:number_of_steps
   
   # Update the positions
   current_positions =  # some scheme of "current_positions"

   # Update the list of positions and steps
   steps = push!(steps, current_step)
   positions[ :, :, current_step] = current_positions
end

What’s left is an array of N by 3 matricies where each “slice” is the data at a tilmestep. This works for what I need, as my intent is to have access to all the positions of all the particles for each time step. But I’m wondering if there is a way I could optimize the way I store the positions at each step?

The advantage to this is that I can slice along any dimension to get the positions at any step with simple indexing, but I assume that as my simulation gets large, the array can become very large and consume a significant amount of memory. This is because, as I understand it, an array is contiguous in memory.

The other options I could think of is making a vector of matrices Vector{Matrix{Float64}}() where each matrix in that vector is not continuous in memory. So things would prevent a single address in memory from become too large?

Thank you for your patience with educating me on this.

Tarny_GG_Channie · August 17, 2024, 2:27pm

Julia is column-major, meaning that the memory is contiguous on the columns, so A[1,x] and A[2,x] would be adjacent to one another. With that, you could design the system as appropriate. Since the xyz are fixed, you could also make a struct or use a static array for it.

ckawa · August 17, 2024, 2:53pm

Thanks for the reply!

Since Julia is column-major, a 3x3 matrix A in Julia:

A = [x1 y1 z1; 
     x2 y2 z2; 
     x3 y3 z3]

would be stored in memory in column-major order:

Memory Layout:  x1, x2, x3, y1, y2, y3, z1, z2, z3

Please correct my understanding if I’m wrong: if my interaction of the data is primarily calling the (x, y, z) position of particle “n”, I would want those values next to each other in memory. So I’d want to organize “A” as

A = [x1 x2 x3; 
     y1 y2 y3; 
     z1 z2 z3]

So the memory becomes:

Memory Layout:  x1, y1, z1, x2, y2, z2, x3, y3, z3

So all the (x,y,z) are adjacent in memory?

mrufsvold · August 17, 2024, 3:02pm

You are correct!

Also,

struct Position{T}
    x::T
    y::T 
    Z::T 
end

positions = Position{Float64}[
    Position(1.0,2.0,3.0),
# etc
]

This will have x,y,z next to each other in memory. You also get the benefit of indexing with positions[n].y.

DNF · August 17, 2024, 3:32pm

Also check out StaticArrays.jl

eldee · August 17, 2024, 3:37pm

Chances are you’re already aware of this, but if not, take a look at the Performance Tips in the documentation.

Another way to think about this (which at least for me helps with higher-dimensional Arrays), is that you want to arrange the axes of an Array in the order of decreasing increment frequency. For example, with x = rand(100, 200, 300),

for i = axes(x, 1)  # (1:100)
    for j = axes(x, 2)  # (1:200)
        for k = axes(x, 3)  # (1:300)
            x[i, j, k] *= 2
        end
    end
end

where the right-most index k changes most frequently, is slower than

for k = axes(x, 3)  # (1:300)
    for j = axes(x, 2)  # (1:200)
        for i = axes(x, 1)  # (1:100)
            x[i, j, k] *= 2
        end
    end
end

where the left-most index i changes the fastest. On my machine the timings are 20.844 ms versus 2.664 ms.
In your case that probably means you want positions = Array{Float64}(undef, 3, N, number_of_steps). Or, as already highlighted by Tarny_GG_Channie, mrufsvold and DNF, you could use Array{Position{Float64}}(undef, N, number_of_steps) or Array{SVector{3, Float64}}(undef, N, number_of_steps). All these variants have the same memory layout. Note that Position and SVector are not mutable. If mutability is desired, you can use mutable struct Position or MVector, though this might come with a performance penalty.

You might have just misworded your thoughts here, but each Matrix{Float64} would still be contiguous in memory. Adjacent matrices (in terms of index in the Vector) need not be adjacent in memory though.

DNF · August 17, 2024, 4:08pm

So is Matlab, actually.

(Actually, Julia Array is column major. There’s nothing intrinsically column-major about JuliaLang.)

IlianPihlajamaa · August 17, 2024, 4:56pm

In my own MD code, I store the particle positions (and velocities) as a Vector{SVector{d, Float}}, where d is the dimensionality. I think Molly.jl does something similar. I found this is the best compromise between ease of use and performance.

Then, every time i want the data saved (which definitely is not every time step), I save them to disk (not memory) in an HDF5 file with JLD2.jl. This circumvents the large memory use issue you mention, and, if you save infrequently enough, should also not impact performance.

eldee · August 17, 2024, 5:04pm

Note that you don’t need the steps = , and more importantly, that in your simplified code this boils down to steps = collect(1:number_of_steps), i.e. at the end you get steps = [1, 2, ..., number_of_steps] . If in the actual code you do something more involved, presumably you could still preallocate steps using number_of_steps.

ckawa · August 17, 2024, 5:08pm

You are correct!

This option would very nice to access the (x,y,z) coordinates once the loop (simulation) is finished. Thank you!

mkitti · August 17, 2024, 5:36pm

This might be a problem as it creates a Any[]. You might want

steps = Int[]

Notice the difference of types below.

julia> steps = []
Any[]

julia> push!(steps, 1)
1-element Vector{Any}:
 1

julia> push!(steps, 2)
2-element Vector{Any}:
 1
 2

julia> typeof(steps)
Vector{Any} (alias for Array{Any, 1})

julia> steps = Int[]
Int64[]

julia> push!(steps, 1)
1-element Vector{Int64}:
 1

julia> push!(steps, 2)
2-element Vector{Int64}:
 1
 2

julia> typeof(steps)
Vector{Int64} (alias for Array{Int64, 1})

ckawa · August 17, 2024, 7:25pm

That’s interesting! So each SVector{d,Float} is the (x,y,z,vx,vy,vz) for each particle? and the “outer” vector in Vector{SVector{d,Float}} would be the vector of positions for some particle across all time?

IlianPihlajamaa · August 17, 2024, 8:09pm

So the vector r would contain the positions of all particles at the current point in time. For example,

r = [
SVector(x1, y2),
SVector(x2, y2),
SVector(x3, y3),
]

would represent three particles in two dimensions. I would have similar vectors v and F for the velocities and forces. Using SVectors gives you very convenient syntax. For example, the kinetic energy of particle 2 is 0.5m*sum(v[2].^2), and this would be a very performant implementation.

By choosing to save to disk instead of to some big trajectory variable I lose the ability to access directly the positions of particles at some earlier time during the simulation. Usually that is not necessary. Then, when you need to do some analysis afterwards, you read everything from disk and construct this big array containing the trajectories of all particles at all time frames.

Edit: of course, if you do prefer it, you could also have a vector of vectors of SVectors that you push these to and keep them in memory, but in that case you must be sure you have the required memory availability to store all of it.

ckawa · August 17, 2024, 11:23pm

Thanks for sharing! I must be missing something about SVector’s because I did a very simple bench test:

M = [
SVector(1, 2),
SVector(1.1, 2.2),
SVector(1.11, 2.22)
]

n_steps = 1E5

function bench_svector()
    for step in 1:n_steps
        delta_M = step / n_steps
        out =  M .+ (SVector(delta_M, delta_M),)
    end
end

and I get something around 106.523 ms (1099490 allocations: 38.14 MiB)

Then I try

M = [
    1.0  2.0;
    1.1  2.2;
    1.11 2.22
]
n_steps = 1E5

function bench_matrix()
    for step in 1:n_steps
        delta_M = step / n_steps
        out = M .+ delta_M
    end
end

and I get something like 13.652 ms (599490 allocations: 21.35 MiB)

Does the benefit of SVector’s come from larger scales?

Oscar_Smith · August 17, 2024, 11:28pm

Don’t benchmark with non-const global variables.

function bench_svector(M, n_steps)
    for step in 1:n_steps
        delta_M = step / n_steps
        out =  M .+ (SVector(delta_M, delta_M),)
    end
end
function bench_matrix(M, n_steps)
    for step in 1:n_steps
        delta_M = step / n_steps
        out = M .+ delta_M
    end
end

julia> @btime bench_svector(M, 10^5)
  2.988 ms (100000 allocations: 10.68 MiB)
julia> @btime bench_matrix(M, 10^5)
  3.254 ms (100000 allocations: 10.68 MiB)

ckawa · August 17, 2024, 11:34pm

Makes sense thank you!

PeterSimon · August 17, 2024, 11:41pm

Also, for future reference, note that the above produces a Float64 quantity:

julia> n_steps = 1E5
100000.0

julia> typeof(n_steps)
Float64

You can use the following instead to create an Int:

julia> n_steps = 10^5
100000

lmiq · August 17, 2024, 11:42pm

You may find this notebook useful depending on the type of simulation you are planning to do: https://m3g.github.io/2021_FortranCon/

DNF · August 18, 2024, 6:20am

Any advantage to using SVectors will completely drown in the enormous amount of array allocations going on here. (And I wouldn’t expect this particular example to benefit much from StaticArrays anyway, since a scalar is at least as fast as an SVector).

eldee · August 18, 2024, 8:29am

To give a simple example where a Vector{SVector{3, Float64}} of length N is much faster than a 3 \times N Matrix, consider

using StaticArrays, BenchmarkTools

# Translate each point in in_v over the common translation vector x_v, 
# and store the result in the appropriate location of out_v
function bench_vec_svector!(out_v, in_v, x_v)
    for i = eachindex(out_v)
        out_v[i] = in_v[i] .+ x_v  # (simply + also works, and is equally fast)
    end
end

# Same, but points are stored in columns of matrices in_m and out_m
function bench_matrix!(out_m, in_m, x_m)
    for i = axes(out_m, 2)
        out_m[:, i] .= @view(in_m[:, i]) .+ x_m
    end
end


N = 10^5

in_v = rand(SVector{3, Float64}, N)
out_v = similar(in_v)
x_v = rand(SVector{3, Float64})

@btime bench_vec_svector!(out_v, in_v, x_v);
#   86.700 μs (0 allocations: 0 bytes)

in_m = rand(3, N)
out_m = similar(in_m)
x_m = rand(3)

@btime bench_matrix!(out_m, in_m, x_m);
#   690.000 μs (0 allocations: 0 bytes)

Because the compiler knows the size of each SVector it can unroll the inner broadcasting loop, reducing overhead. The equivalent Matrix version would be

function bench_matrix_unrolled!(out_m, in_m, x_m)
    for i = axes(out_m, 2)
        out_m[1, i] = in_m[1, i] + x_m[1]
        out_m[2, i] = in_m[2, i] + x_m[2]
        out_m[3, i] = in_m[3, i] + x_m[3]
    end
end

@btime bench_matrix_unrolled!(out_m, in_m, x_m);
#  120.900 μs (0 allocations: 0 bytes)

I’m not sure why the SVector version is faster still, but hey, it sure does illustrate the point that Vector{SVector} can indeed be much more performant than Matrix.

Topic		Replies	Views
Learning to optimize Performance	12	1229	March 15, 2021
Design of N dimensional structure for Particle General Usage	4	499	January 9, 2021
Julia style? Arrays/matrix vs structs General Usage	19	3880	December 18, 2019
Quick Longitudinal Access to arrays General Usage	34	2261	March 3, 2018
Usage of arrays of static arrays New to Julia performance , staticarrays	16	1097	February 22, 2023

Memory best practices in Julia with Arrays vs Vectors

Related topics