Reducing allocations

I have a function that is going to be called many hundreds of thousands of times so I want it optimized maximally. To reduce allocations, I decided I would create a struct the stored some buffer arrays, then I would assign values to them “in-place” to avoid repeated allocations. Here is the function:

function propose_shear_step(atoms::ase.atoms,stepsize::Float64; buffers::Union{NS_buffers,Nothing} = nothing)
    rmv_vec = rand(1:3)
#  rmv_vec = rand(Xoshiro(),1:3)
    # Don't allocate a vector of indices, a tuple is better for memory management.
    remaining_vecs = rmv_vec == 1 ? (2,3) : rmv_vec == 2 ? (1,3) : (1,2)
    new_cell = buffers !== nothing ? buffers.new_cell : MMatrix{3,3,Float64,9}(undef)
    new_cell .= atoms.lVecs
    
    v1 = SVector{3,Float64}(atoms.lVecs[:,remaining_vecs[1]])
    v2 = SVector{3,Float64}(atoms.lVecs[:,remaining_vecs[2]])
    norm_v1 = sqrt(dot(v1,v1))
    v1 = v1 / norm_v1
    v2 = v2 -  v1 * dot(v1, v2)  # Make v2 orthogonal to v1
    norm_v2 = sqrt(dot(v2,v2))
    v2 = v2/norm_v2

    if abs(dot(v1, v2)) > 1e-4
        println("v1 and v2 are not orthogonal, something is wrong.")
    end
    rv = MVector{2,Float64}(undef)
    rand!(Xoshiro(), Normal(0, stepsize), rv)
    rv1, rv2 = rv
#  rv1,rv2 = rand(Normal(0,stepsize),2)
    new_cell[:,rmv_vec] = atoms.lVecs[:,rmv_vec] + rv1 * v1 + rv2 * v2
    inv_lVecs = buffers !== nothing ? buffers.inv_lVecs : inv(SMatrix{3,3,Float64,9}(atoms.lVecs))
      if buffers !== nothing
        mul!(buffers.transform,new_cell,buffers.inv_lVecs)
        return (1.0,SMatrix{3,3,Float64,9}(buffers.transform))
    else
        inv_lVecs = inv(SMatrix{3,3,Float64,9}(atoms.lVecs))
        transform = new_cell * inv_lVecs
        return (1.0,SMatrix{3,3,Float64,9}(transform))
    end
    
end

and here is what ase.atoms, and NS_buffers look like:

struct NS_buffers
    new_cell :: MMatrix{3,3,Float64,9}
    transform :: MMatrix{3,3,Float64,9}
    inv_lVecs :: SMatrix{3,3,Float64,9}
    
end

mutable struct atoms
    title::String
    latpar::Float64
    lVecs::MMatrix{3,3,Float64,9}
    coordSys::Vector{String} # 'D' for Direct or 'C' for cartesian
    positions:: Vector{SVector{3,Float64}}  # List of all the atomic basis vector separated by type 
    velocities:: Vector{SVector{3,Float64}} # Velocities of the atoms
    masses:: Vector{Float64}
    atomTypes:: Vector{Int64} # Integers representing the type for each basis atom
    species:: Vector{String}  # Species of the atoms 
    FP_total_energy:: Float64  # First principles total or peratom energy
    model_energy:: Float64  # Energy that should be used in the fitting process
    lj_vec::Vector{Float64} # Vector of 1/r^6 and 1/r^12 values for each unique pair of atom types
end

But the results of @btime indicate that the performance is worse when I send in the optional buffers as compared to not sending buffers in at all. I’ve always had a hard time identifying allocations. Any help here? Thanks in advance

1 Like

These lines allocate memory. The first one can be solved by @views, the second one by

@views @. new_cell[:,rmv_vec] = atoms.lVecs[:,rmv_vec] + rv1 * v1 + rv2 * v2

I’ve always had a hard time identifying allocations.

The allocation profile @profview_allocs in julia vscode plugin is very helpful.

6 Likes

I am curious what the type String does in a struct for efficiency.

mutable struct atoms
    title::String

Does this force the whole struct to be on the heap making it less efficient?

Just the fact that atoms is mutable means it has to be on the heap.

1 Like

What if it were a struct rather than a mutable struct?

I might be able to make that work but there are float variables inside there that are immutable and I need to be able to change them on the fly.

Maybe store FP_total_energy and model_energy in a Vector and just remember which one slots in first and which one second?

Maybe it would be better if I just stepped back and ask the question: I have this function that is going to be called many, many times (I have others that will do the same but if someone can help me see what to look for I can apply to the other ones too) . What are some opportunities for efficiency improvements that you can see. Generally speaking, how can I know how close I am to optimal efficiency?

Originally, I thought that the repeated creation of new_cell would be a bottleneck and if I could use a pre-allocated buffer that would help things out. Not convinced that is the case now.

function propose_shear_step(atoms::ase.atoms,stepsize::Float64)
    rmv_vec = rand(1:3)
    remaining_vecs = rmv_vec == 1 ? (2,3) : rmv_vec == 2 ? (1,3) : (1,2)
    new_cell = MMatrix{3,3,Float64,9}(undef)
    new_cell .= atoms.lVecs
    
    v1 = atoms.lVecs[:,remaining_vecs[1]]
    v2 = atoms.lVecs[:,remaining_vecs[2]]
    norm_v1 = sqrt(dot(v1,v1))
    v1 = v1 / norm_v1
    v2 = v2 -  v1 * dot(v1, v2)  # Make v2 orthogonal to v1
    norm_v2 = sqrt(dot(v2,v2))
    v2 = v2/norm_v2

    if abs(dot(v1, v2)) > 1e-4
        println("v1 and v2 are not orthogonal, something is wrong.")
    end

    rv1,rv2 = rand(Normal(0,stepsize),2)
    new_cell[:,rmv_vec] = atoms.lVecs[:,rmv_vec] + rv1 * v1 + rv2 * v2
    inv_lVecs = inv(SMatrix{3,3,Float64,9}(atoms.lVecs))
    transform = new_cell * inv_lVecs
    return (1.0,SMatrix{3,3,Float64,9}(transform))

end

Did you profile it to see precisely where the bottlenecks are?

I have not had any luck using profilers and understanding the output. I’ve tried using @allocated line by line with some amount of success.

1 Like

If the sizes are fixed the way they seem to be, you could simply do it all with SVectors and SMatrices without bothering with any buffers.

function propose_shear_step(atoms, stepsize::Float64)
    rmv_vec = rand(1:3)
    remaining_vecs = rmv_vec == 1 ? (2,3) : rmv_vec == 2 ? (1,3) : (1,2)

    l = SMatrix(atoms.lVecs)
    
    v1 = l[:, remaining_vecs[1]]
    v2 = l[:, remaining_vecs[2]]
    norm_v1 = sqrt(dot(v1,v1))
    v1 = v1 / norm_v1
    v2 = v2 -  v1 * dot(v1, v2)  # Make v2 orthogonal to v1
    norm_v2 = sqrt(dot(v2, v2))
    v2 = v2/norm_v2

    if abs(dot(v1, v2)) > 1e-4
        println("v1 and v2 are not orthogonal, something is wrong.")
    end

    rv1 = rand(Normal(0, stepsize))
    rv2 = rand(Normal(0, stepsize))
    
    new_vec = l[:, rmv_vec] + rv1 * v1 + rv2 * v2
    l1 = l[:, 1]
    l2 = l[:, 2]
    l3 = l[:, 3]
    if rmv_vec == 1
        new_cell = hcat(new_vec, l2, l3)
    elseif rmv_vec == 2
        new_cell = hcat(l1, new_vec, l3)
    elseif rmv_vec == 3
        new_cell = hcat(l1, l2, new_vec)
    else
        error("BAD")
    end

    inv_lVecs = inv(l)
    transform = new_cell * inv_lVecs
    return 1.0, transform

end

No allocations for this one. I think I followed the logic of your code, but I’m not 100%. Regardless, I think something like this should be achievable.

I removed the type annotation for atoms so I could send in a named tuple rather than a struct, but I don’t think that should matter.

3 Likes

I like it. I’m surprised you didn’t allocate when you created new_cell. Was that just because it was constructed out of SVectors?

atoms.lVecs is an MMatrix and if I just copy it over to new_cell and then modify it, I can get it down to 1 allocation for the copy.

I think I just need to learn a little bit more about SVectors/SMatrices. For example, whey doesn’t this line produce allocations:

new_vec = l[:, rmv_vec] + rv1 * v1 + rv2 * v2

Is it just because everything on the right hand side is a static array/vector?

Also, does

l = SMatrix(atoms.lVecs)

create a copy of atoms.lVecs or is it by reference. I suppose it doesn’t matter since l never gets modified but…

Yes, new_cell does not allocate because the output is a 3x3 SMatrix, and because the by using an SVector as each argument for hcat, the output type (the size) of the SMatrix is entirely inferrable from the input.

The new_vec line does not allocate because everything is an SVector.

The l = SMatrix(atoms.lVecs) creates a copy, but it does not allocate because it’s an SMatrix, and the type of the SMatrix can be inferred from the type of the MMatrix.

It’s possible that julia in the future might be able to elide the allocation used in temporary MMatrices that do not escape the function in which it is used, but it seems to be a bit hit and miss at the moment.

Edit: I just realized that this may not have been clear, but generally, creating an SVector will not allocate any memory, but creating an MVector will allocate memory. To not get allocations in creating the SVector, one must make sure that its size is known at compile time though.

1 Like

Do these help?

3 Likes

This surely does not do what you intend. For one, Xoshiro is a mutable struct so this possibly causes heap allocation:

julia> dump(Xoshiro)
mutable struct Xoshiro <: AbstractRNG
  s0::UInt64
  s1::UInt64
  s2::UInt64
  s3::UInt64
  s4::UInt64

More importantly, the semantics are surely not what you want. Instead of creating a new random generator state each time you call rand!, why not just let it use the default random generator, by ommiting the argument? It might make sense to provide a RNG explicitly, but then you want to also take it as argument to propose_shear_step.

Why not norm from LinearAlgebra?

Instead of inv and * (multiplying by the inverse), I guess just dividing directly, using /, should be more efficient.

It should be more efficient to convert to a static array before doing the multiplication/division, instead of after.

I think I’ve not seen this mentioned yet in the above, but you may want to replace rand(Normal(0, stepsize)) with sqrt(stepsize)*randn(), this avoids needing to create a Normal(0, stepsize) object

2 Likes

My bad, Normal assumes a standard deviation, so you’d use stepsize*randn(), no square root!

Didn’t I do that:

inv_lVecs = inv(SMatrix{3,3,Float64,9}(atoms.lVecs))

Turn it into an SMatrix first, then take the inverse?

Ah, sorry, I thought that new_cell is not statically sized. Disregard, then.