Memory best practices in Julia with Arrays vs Vectors

I’m transitioning from MATLAB to Julia and I’m wondering if there are any best-practices for optimizing for memory and computational time when writing simulations.

Specifically, I’m designing a simple verlet integration for the motion of N particles where current_positions is a N by 3 matrix where each row represents particle n and each column represents the x,y,z position respectively.

A simplified form looks like this:

# Initialize a 3D array to hold number_of_steps slices of positions
number_of_steps = # some integer value
N = # number of particles
positions = Array{Float64}(undef, N, 3, number_of_steps)
current_positions = # N by 3 matrix of initial positions
steps = [] 
for current_step = 1:number_of_steps
   
   # Update the positions
   current_positions =  # some scheme of "current_positions"

   # Update the list of positions and steps
   steps = push!(steps, current_step)
   positions[ :, :, current_step] = current_positions
end

What’s left is an array of N by 3 matricies where each “slice” is the data at a tilmestep. This works for what I need, as my intent is to have access to all the positions of all the particles for each time step. But I’m wondering if there is a way I could optimize the way I store the positions at each step?

The advantage to this is that I can slice along any dimension to get the positions at any step with simple indexing, but I assume that as my simulation gets large, the array can become very large and consume a significant amount of memory. This is because, as I understand it, an array is contiguous in memory.

The other options I could think of is making a vector of matrices Vector{Matrix{Float64}}() where each matrix in that vector is not continuous in memory. So things would prevent a single address in memory from become too large?

Thank you for your patience with educating me on this.

2 Likes

Julia is column-major, meaning that the memory is contiguous on the columns, so A[1,x] and A[2,x] would be adjacent to one another. With that, you could design the system as appropriate. Since the xyz are fixed, you could also make a struct or use a static array for it.

1 Like

Thanks for the reply!

Since Julia is column-major, a 3x3 matrix A in Julia:

A = [x1 y1 z1; 
     x2 y2 z2; 
     x3 y3 z3]

would be stored in memory in column-major order:

Memory Layout:  x1, x2, x3, y1, y2, y3, z1, z2, z3

Please correct my understanding if I’m wrong: if my interaction of the data is primarily calling the (x, y, z) position of particle “n”, I would want those values next to each other in memory. So I’d want to organize “A” as

A = [x1 x2 x3; 
     y1 y2 y3; 
     z1 z2 z3]

So the memory becomes:

Memory Layout:  x1, y1, z1, x2, y2, z2, x3, y3, z3

So all the (x,y,z) are adjacent in memory?

You are correct!

Also,

struct Position{T}
    x::T
    y::T 
    Z::T 
end

positions = Position{Float64}[
    Position(1.0,2.0,3.0),
# etc
]

This will have x,y,z next to each other in memory. You also get the benefit of indexing with positions[n].y.

2 Likes

Also check out StaticArrays.jl

Chances are you’re already aware of this, but if not, take a look at the Performance Tips in the documentation.

Another way to think about this (which at least for me helps with higher-dimensional Arrays), is that you want to arrange the axes of an Array in the order of decreasing increment frequency. For example, with x = rand(100, 200, 300),

for i = axes(x, 1)  # (1:100)
    for j = axes(x, 2)  # (1:200)
        for k = axes(x, 3)  # (1:300)
            x[i, j, k] *= 2
        end
    end
end

where the right-most index k changes most frequently, is slower than

for k = axes(x, 3)  # (1:300)
    for j = axes(x, 2)  # (1:200)
        for i = axes(x, 1)  # (1:100)
            x[i, j, k] *= 2
        end
    end
end

where the left-most index i changes the fastest. On my machine the timings are 20.844 ms versus 2.664 ms.
In your case that probably means you want positions = Array{Float64}(undef, 3, N, number_of_steps). Or, as already highlighted by Tarny_GG_Channie, mrufsvold and DNF, you could use Array{Position{Float64}}(undef, N, number_of_steps) or Array{SVector{3, Float64}}(undef, N, number_of_steps). All these variants have the same memory layout. Note that Position and SVector are not mutable. If mutability is desired, you can use mutable struct Position or MVector, though this might come with a performance penalty.

You might have just misworded your thoughts here, but each Matrix{Float64} would still be contiguous in memory. Adjacent matrices (in terms of index in the Vector) need not be adjacent in memory though.

1 Like

So is Matlab, actually.

(Actually, Julia Array is column major. There’s nothing intrinsically column-major about JuliaLang.)

2 Likes

In my own MD code, I store the particle positions (and velocities) as a Vector{SVector{d, Float}}, where d is the dimensionality. I think Molly.jl does something similar. I found this is the best compromise between ease of use and performance.

Then, every time i want the data saved (which definitely is not every time step), I save them to disk (not memory) in an HDF5 file with JLD2.jl. This circumvents the large memory use issue you mention, and, if you save infrequently enough, should also not impact performance.

Note that you don’t need the steps = , and more importantly, that in your simplified code this boils down to steps = collect(1:number_of_steps), i.e. at the end you get steps = [1, 2, ..., number_of_steps] . If in the actual code you do something more involved, presumably you could still preallocate steps using number_of_steps.

You are correct!

This option would very nice to access the (x,y,z) coordinates once the loop (simulation) is finished. Thank you!

1 Like

This might be a problem as it creates a Any[]. You might want

steps = Int[]

Notice the difference of types below.

julia> steps = []
Any[]

julia> push!(steps, 1)
1-element Vector{Any}:
 1

julia> push!(steps, 2)
2-element Vector{Any}:
 1
 2

julia> typeof(steps)
Vector{Any} (alias for Array{Any, 1})

julia> steps = Int[]
Int64[]

julia> push!(steps, 1)
1-element Vector{Int64}:
 1

julia> push!(steps, 2)
2-element Vector{Int64}:
 1
 2

julia> typeof(steps)
Vector{Int64} (alias for Array{Int64, 1})
2 Likes

That’s interesting! So each SVector{d,Float} is the (x,y,z,vx,vy,vz) for each particle? and the “outer” vector in Vector{SVector{d,Float}} would be the vector of positions for some particle across all time?

So the vector r would contain the positions of all particles at the current point in time. For example,

r = [
SVector(x1, y2),
SVector(x2, y2),
SVector(x3, y3),
]

would represent three particles in two dimensions. I would have similar vectors v and F for the velocities and forces. Using SVectors gives you very convenient syntax. For example, the kinetic energy of particle 2 is 0.5m*sum(v[2].^2), and this would be a very performant implementation.

By choosing to save to disk instead of to some big trajectory variable I lose the ability to access directly the positions of particles at some earlier time during the simulation. Usually that is not necessary. Then, when you need to do some analysis afterwards, you read everything from disk and construct this big array containing the trajectories of all particles at all time frames.

Edit: of course, if you do prefer it, you could also have a vector of vectors of SVectors that you push these to and keep them in memory, but in that case you must be sure you have the required memory availability to store all of it.

Thanks for sharing! I must be missing something about SVector’s because I did a very simple bench test:

M = [
SVector(1, 2),
SVector(1.1, 2.2),
SVector(1.11, 2.22)
]

n_steps = 1E5

function bench_svector()
    for step in 1:n_steps
        delta_M = step / n_steps
        out =  M .+ (SVector(delta_M, delta_M),)
    end
end

and I get something around 106.523 ms (1099490 allocations: 38.14 MiB)

Then I try

M = [
    1.0  2.0;
    1.1  2.2;
    1.11 2.22
]
n_steps = 1E5

function bench_matrix()
    for step in 1:n_steps
        delta_M = step / n_steps
        out = M .+ delta_M
    end
end

and I get something like 13.652 ms (599490 allocations: 21.35 MiB)

Does the benefit of SVector’s come from larger scales?

Don’t benchmark with non-const global variables.

function bench_svector(M, n_steps)
    for step in 1:n_steps
        delta_M = step / n_steps
        out =  M .+ (SVector(delta_M, delta_M),)
    end
end
function bench_matrix(M, n_steps)
    for step in 1:n_steps
        delta_M = step / n_steps
        out = M .+ delta_M
    end
end

julia> @btime bench_svector(M, 10^5)
  2.988 ms (100000 allocations: 10.68 MiB)
julia> @btime bench_matrix(M, 10^5)
  3.254 ms (100000 allocations: 10.68 MiB)

Makes sense thank you!

Also, for future reference, note that the above produces a Float64 quantity:

julia> n_steps = 1E5
100000.0

julia> typeof(n_steps)
Float64

You can use the following instead to create an Int:

julia> n_steps = 10^5
100000

You may find this notebook useful depending on the type of simulation you are planning to do: https://m3g.github.io/2021_FortranCon/

Any advantage to using SVectors will completely drown in the enormous amount of array allocations going on here. (And I wouldn’t expect this particular example to benefit much from StaticArrays anyway, since a scalar is at least as fast as an SVector).

1 Like

To give a simple example where a Vector{SVector{3, Float64}} of length N is much faster than a 3 \times N Matrix, consider

using StaticArrays, BenchmarkTools

# Translate each point in in_v over the common translation vector x_v, 
# and store the result in the appropriate location of out_v
function bench_vec_svector!(out_v, in_v, x_v)
    for i = eachindex(out_v)
        out_v[i] = in_v[i] .+ x_v  # (simply + also works, and is equally fast)
    end
end

# Same, but points are stored in columns of matrices in_m and out_m
function bench_matrix!(out_m, in_m, x_m)
    for i = axes(out_m, 2)
        out_m[:, i] .= @view(in_m[:, i]) .+ x_m
    end
end


N = 10^5

in_v = rand(SVector{3, Float64}, N)
out_v = similar(in_v)
x_v = rand(SVector{3, Float64})

@btime bench_vec_svector!(out_v, in_v, x_v);
#   86.700 μs (0 allocations: 0 bytes)

in_m = rand(3, N)
out_m = similar(in_m)
x_m = rand(3)

@btime bench_matrix!(out_m, in_m, x_m);
#   690.000 μs (0 allocations: 0 bytes)

Because the compiler knows the size of each SVector it can unroll the inner broadcasting loop, reducing overhead. The equivalent Matrix version would be

function bench_matrix_unrolled!(out_m, in_m, x_m)
    for i = axes(out_m, 2)
        out_m[1, i] = in_m[1, i] + x_m[1]
        out_m[2, i] = in_m[2, i] + x_m[2]
        out_m[3, i] = in_m[3, i] + x_m[3]
    end
end

@btime bench_matrix_unrolled!(out_m, in_m, x_m);
#  120.900 μs (0 allocations: 0 bytes)

I’m not sure why the SVector version is faster still, but hey, it sure does illustrate the point that Vector{SVector} can indeed be much more performant than Matrix.

1 Like