Stop allocating memory when replacing structs in for-loops

I’m working on a project where I have to simulate many systems and track some observable across the many systems. When I try to create new systems in for-loops, I end up getting allocations of O(n_trials) when it isn’t necessary. Here is my very contrived MWE:


mutable struct InputData{T}
    couplings::Vector{T}
    constant::Float64
end

mutable struct OutputData{T}
    couplings::Vector{T}
    constant::Float64

    OutputData(N) = new{Float64}(Vector{Float64}(undef, N), 0.0)
end

N = 1_000
input = InputData(rand(N), 0.5)

function solve_system(input::InputData)
    N = length(input.couplings)
    # Make new output 
    out = OutputData(N)
    # To solve the system, out.couplings
    # is the ordered input.couplings plus some randomness
    out.couplings .= (input.couplings .+ rand(N)) #should take about 3μs
    sort!(out.couplings) #should take about 3μs
    # out constant is the penultimate coupling  
    out.constant = input.couplings[N-1]
    out
end

# Now I would like to generate many systems, 
# and measure the average of the second coupling

function observe_second_coupling(trials, N)
    second_coupling_mean = 0.0
    input = InputData(rand(N), 0.5)
    for trial in 1:trials
        output = solve_system(input) #This is the problematic point 
        second_coupling_mean += output.couplings[2] / trials
    end
    second_coupling_mean
end

using BenchmarkTools
@benchmark solve_system($input)
@benchmark observe_second_coupling($1000, $N)

Solving one system takes about 30μs on my machine. The solving arithmetic adds up to about 6, so I think a lot of the execution time comes from allocations. Running the observable code takes Nx30μs, so I suspect that I’m paying for reallocating output every time. I am actually running N = 2000, ntrials = 1e6 for the full code and it takes about an hour, which is a pain. Over loops of ntrials I would like to avoid this allocation and hopefully get a speed up.

The problem seems to be the line output = solve_system(input). It is type stable and size stable given the input, so it should be possible to just overwrite the memory used by the old output system and thus avoid any allocations. I don’t know how to do this in Julia yet. Some ideas are:

  1. Use a mutable buffer struct that is passed into a mutating solve_system! function. This seems to work, but gets very messy, and means maintaining a mutating and n0n-mutating version of solve_system in parallel.
  2. Pass everything around as arrays to be mutated, which seems to be an even less coherent version of 1.
  3. Find a fancy package that fixes the problem - I looked at Accessors.jl and Setfields.jl but I don’t think they work like this.
  4. Find a way of telling Julia enough about the memory requirements of OutputData.new() that it goes “oh this is fine we can reuse this memory”.

I think 4 would be my preferred solution. Does anyone have any ideas? TIA!

Just make InputData and OutputData immutable, and construct new ones (that share couplings.

function solve_system(input::InputData)
    N = length(input.couplings)
    # Make new output 
    out = OutputData(N)
    # To solve the system, out.couplings
    # is the ordered input.couplings plus some randomness
    out.couplings .= (input.couplings .+ rand(N)) #should take about 3μs
    sort!(out.couplings) #should take about 3μs
    # out constant is the penultimate coupling  
    out = OutputData(out.couplings, input.couplings[N-1])
    out
end

Well, you can allocate OutputData as a global variable and directly reuse this mutable struct. And it’s your responsibility to ensure that you won’t overwrite the struct in usage…
A simple solution like this:

const GlobalCache = OutputData(0)
const isGlobalCacheAllocated = Ref{Bool}(false)
function allocateOne()
    @assert !isGlobalCacheAllocated[]
    isGlobalCacheAllocated[] = true
    return GlobalCache
end

function deallocate()
     @assert isGlobalCacheAllocated[]
     isGlobalCacheAllocated[] = false
end

Thanks for this @Oscar_Smith! This seems to get rid of one of the allocs in the solve step:

struct SInputData{T}
    couplings::Vector{T}
    constant::Float64
end

struct SOutputData{T}
    couplings::Vector{T}
    constant::Float64

    SOutputData(N) = new{Float64}(Vector{Float64}(undef, N), 0.0)
    SOutputData(C, K) = new{Float64}(C, K)
end


N = 1_000
input = SInputData(rand(N), 0.5)

function solve_system(input::SInputData)
    N = length(input.couplings)
    # Make new output 
    out = SOutputData(N)
    # To solve the system, out.couplings
    # is the ordered input.couplings plus some randomness
    out.couplings .= (input.couplings .+ rand(N)) #should take about 3μs
    sort!(out.couplings) #should take about 3μs
    # out constant is the penultimate coupling  
    out = SOutputData(out.couplings, input.couplings[N-1])
    out
end

But I still get ~2000 allocs for ntrials = 1000.

Your other allocation is rand(N) which you can replace with rand.() to remove the allocation.

1 Like

You can use a broadcasting trick here that avoids allocations:

out.couplings .= (input.couplings .+ rand.())

Notice the dot in rand.().