Pre-allocating a structure to reduce memory?

I have a function SimData that I will call in a loop 1000s of times. I will then “call the loop” a number of times in the course of searching for a maxima. So the SimData function will be call perhaps 100,000s of times. So if any memory is allocated within it, as it currently stands, the total memory footprint of the maximization routine is potentially GBs, especially as the size of the data vectors I’m simulating is itself large (maybe 5000 x 5000 matrices).

So - I would like to avoid allocating in a way that takes up memory within SimData as much as possible. I’ve reduced a lot of it through the use of Static Vectors. However, the function uses some Vectors of Static Vectors as intermediate variables and those take up memory.

How would I maybe use a pre-allocated structure passed through the loop to SimData to avoid the loops compounding the memory used? Below is a toy example with just one variable that I would like to pre-allocate but even with just this each iteration of `SimData’ would use ~20-30k of memory.

nobs = 2500 #number of observations
nsims = 1000 # number of simulations per obs
struct SimOutputs
    eps_sim :: Vector{SVector{2,Float64}}

function DrawUnobsData(n,Σ)
    svectorscopy(rand(MvNormal([0.0, 0.0],Σ),n), Val{2}()) 

function SimData(nobs,i)
   eps_sim = Vector{SVector{2,Float64}}(undef,nobs*nobs)
   eps_sim = DrawUnobsData(nobs , [1.0 0.0; 0.0 1.0])
   Simout = SimOutputs(eps_sim)

function LoopSim(nsims,nobs)
    sim_dat = Vector{SimOutputs}(undef,nsims)
    sim_dat = [SimData(nobs,314159+i) for i i eachindex(sim_dat)]


I made nsims and nobs globals here for ease… What I would like to do is pre-allocate eps_sim in a structure (in the real code the structure would contain ~10 similar vectors of svectors) and pass these through in such a way that the memory used doesnt expand with each call to SimDat

eps_sim = Vector{SVector{2,Float64}}(undef,nobs*nobs)
   eps_sim = DrawUnobsData(nobs , [1.0 0.0; 0.0 1.0])

This doesn’t make sense, the first line allocates a vector and the second one overwrites the variable with a different one. Preallocating means to reuse the memory of an array by repeatedly overwriting it in the course of an algorithm instead of requesting new memory every time. So an array needs to be created outside of the loop and then passed in so that its memory can be reused. However it seems to me that you want to store the simulated data of each step, so you do need to allocate space for each single result (nothing can be overwritten here). However, it could potentially be faster to preallocate one big contiguous array (like a Matrix, not a Vector of Vectors) and write all your results into that, this way you would only pay the allocation time once and not in every loop step. That means, it should be much faster to allocate one array with a million elements than a thousand arrays with a thousand elements each.

I was writing kind of the same as @jules

Here is an example with a plain Array which runs already quite fast:

using Random
using Distributions
using BenchmarkTools

p = (; nobs = 2500, nsims = 1000, μ = [0.0, 0.0], Σ = [1.0 0.0; 0.0 1.0])

function simulate!(sim_data, p, sim_seed)
    Random.seed!(1234 + sim_seed)
    distr = MvNormal(p.μ, p.Σ)
    rand!(distr, sim_data[:,:])

function simulate_repetitions(p)
    sim_data = Array{Float64}(undef, 2, p.nobs, p.nsims)
    for i in axes(sim_data, 3)
        simulate!(view(sim_data, :, :, i), p, 314159 + i)

    return sim_data

@btime simulate_repetitions(p);
# 137.980 ms (13002 allocations: 76.95 MiB)

Still, some allocation to cut down so the code is not fully optimized yet :wink:

For comparison, this is the plain runtime of rand! in this setting:

sim_data = Array{Float64}(undef, 2, p.nobs * p.nsims)
@btime rand!($(MvNormal(p.μ, p.Σ)), $sim_data);
#  40.939 ms (1 allocation: 16 bytes)

I will make the central function slightly less trivial to better illustrate what I want. Now it takes two intermediate vectors of Svectors and adds them together and outputs the sum. I’d like to preallocate the intermediate vectors.

nsims = 1000 # number of simulations per obs
struct SimOutputs
    eps_sim :: Vector{SVector{2,Float64}}

function DrawUnobsData(n,Σ)
    svectorscopy(rand(MvNormal([0.0, 0.0],Σ),n), Val{2}()) 

function SimData(nobs,i)
   eps_sim = Vector{SVector{2,Float64}}(undef,nobs*nobs)
   eps_sim = DrawUnobsData(nobs , [1.0 0.0; 0.0 1.0])
   eps_add = Vector{SVector{2,Float64}}(undef,nobs*nobs)
   eps_add = DrawUnobsData(nobs , [1.0 0.0; 0.0 1.0])
   eps_sum .= SVector{2,Float64}}(eps_add .+ eps_sim)

   Simout = SimOutputs(eps_sum)

function LoopSim(nsims,nobs)
    sim_dat = Vector{SimOutputs}(undef,nsims)
    sim_dat = [SimData(nobs,314159+i) for i i eachindex(sim_dat)]


This line will probably cause many allocations:
rand(MvNormal([0.0, 0.0],Σ),n)
since you first create a 2 x n array and then convert it to an Vector{SVector}. Therefore it doesn’t add any advantage.

In my experiments I had the feeling that calling the in-place variant rand! is quite important.

I’ve checked and the allocation lines = Vector{SVector{2,Float64}}(undef,nobs*nobs) add the allocations. The call the function which contains the rand lines do not.

Could it be that the line should be Vector{SVector{2,Float64}}(undef,nobs)?

How about this? Maybe you could post the numbers you get in your case, then it is easier to compare.

using Random
using Distributions
using BenchmarkTools

nobs = 2500 #number of observationsnsims = 1000 # number of simulations per obs
nsims = 1000 # number of simulations per obs

function DrawUnobsData!(data, Σ)
    rand!(MvNormal([0.0, 0.0], Σ), data)

function SimData(nobs,i)

   distr = MvNormal([0.0, 0.0], [1.0 0.0; 0.0 1.0])
   eps_sim = Array{Float64}(undef,2, nobs)
   DrawUnobsData!(eps_sim , [1.0 0.0; 0.0 1.0])
   eps_add = Array{Float64}(undef,2, nobs)
   DrawUnobsData!(eps_add , [1.0 0.0; 0.0 1.0])
   eps_sum = eps_add .+ eps_sim

   return eps_sum

function LoopSim(nsims,nobs)
    sim_dat = [SimData(nobs,314159+i) for i in 1:nsims]

@btime LoopSim(nsims,nobs);
#   277.337 ms (33002 allocations: 116.13 MiB)

Per-allocating the two vectors could be done like this, but it doesn’t seem to cut down the runtime a lot.

nobs = 2500 #number of observationsnsims = 1000 # number of simulations per obs
nsims = 1000 # number of simulations per obs

function SimData(nobs,i, cache)

   distr = MvNormal([0.0, 0.0], [1.0 0.0; 0.0 1.0])

   rand!(distr, cache.A)
   rand!(distr, cache.B)
   cache.res .= cache.A .+ cache.B

   return cache.res

function LoopSim(nsims,nobs)
    cache = (;
        A = Array{Float64}(undef,2, nobs), 
        B = Array{Float64}(undef,2, nobs), 
        res = Array{Float64}(undef,2, nobs))

    sim_data = Array{Float64}(undef, 2,nobs, nsims)
    for i in 1:nsims
        sim_data[:,:,i] .= SimData(nobs,314159+i, cache)
    return sim_data

@btime LoopSim(nsims,nobs);
# 238.740 ms (15008 allocations: 39.10 MiB)

One weakness here is that sim_data[:,:] creates a copy, writes random numbers into that copy, then throws it away, so the generated random data just disappears. Remove the [:, :] and the data is written into the view, as is probably wanted.

On every iteration a Vector and a Matrix are needlessly allocated here. Perhaps use staticarrays for parameters, and any way hoist the creation of this distribution object out of the loop, since it appears to be constant for all iterations.


Thanks, yes, removing the unnecessary copy basically led to half the amount of allocations and using StaticArrays for the MvNormal also helped!

@jrhalket Here is my final attempt for the example where the random arrays should be added.

  • With non-changing random seed it would only need 13 allocations, e.g. all the vector operations are now allocation free.
  • With changing random seed, it needs 6 * nsims + 13 allocations, since changing the seed takes some allocations. One could use another random generator which needs less, but the default one seems the fastest.
using Random, Distributions, BenchmarkTools, StaticArrays
function simulate_repetitions(nsims, nobs)
    A = Array{Float64,2}(undef, 2, nobs)
    B = similar(A)
    sim_data = Array{Float64,3}(undef, 2, nobs, nsims)

    rng = Xoshiro(1)
    distr = MvNormal(@SVector([0.0, 0.0]), @SMatrix([1.0 0.0; 0.0 1.0]))
    @inbounds for i in 1:nsims
        Random.seed!(rng, 1234 + 314159 + i)
        rand!(rng, distr, A)
        rand!(rng, distr, B)
        @. sim_data[1:2,1:nobs,i] = A + B

    return sim_data
@btime simulate_repetitions(1000, 2500);
# 84.679 ms (6013 allocations: 38.67 MiB)
# without `Random.seed!(rng, ...)` it only needs 83.658 ms (13 allocations: 38.22 MiB)
# using the global random generator it needs 119.555 ms (7006 allocations: 38.68 MiB)

Thank you all for your careful attention to my query.
I think my “minimal” working example was perhaps too minimal still. I will overshoot in the other direction and provide below a working copy of my actual code at the moment.

The key function for my query is sim_data_like. It is the one that is called n_sim times per call of the likelihood function. As of now, it has “too much” memory allocated within it so that each call of the likelihood function demands more than n_sim times that amount of memory. (The function CompParams currently sets the amount of simulations. Eventually I will want to simulate at least n_sim=1000 times for n_firms=5000 – that will cause memory issues for sure.
I’ve put an @time call both at the bottom on the line that calls the likelihood function and also one one line within sim_data_like. The purpose of the latter is so that you can easily see that it is the allocation of the vector of svectors that is the offending line. The “math” parts of the code are currently lean - they, for the most part, do not ask for much memory. It is just the ~10 instance of allocating vectors of svectors within sim_data_like that adds up to the memory demands of that entire function. If I could reduce those only, I would trim much of the issue perhaps.
Some other notes: I know some other functions (e.g. SimRawData) are also flabby at the moment. Don’t worry about those unless they have a direct bearing on the memory demand of sim_data_like.

#setting up the linear program
# Required Packages
using Distributions, Random, LinearAlgebra, BenchmarkTools, SparseArrays
using StaticArrays, GLPK, Cbc, Ipopt, Optim, StatsBase, KernelDensitySJ
using Assignment, JuMP

function CompParams(;n_firms=50, n_sim=100, trim_percent=5, hmethod=1)
        n_firms = n_firms,
       # number of simulations for each iteration, i.e. n_sim 500 by 500 markets
       trim_percent = trim_percent, #which percentil as cutoff for trim fn
       hmethod = hmethod) #hmethod=1 --> Silverman's RoT. =2 -->   Sheather and Jones (1991) bandwidth
    return p

function StrucParams(b::Vector{Float64})
    Σ_m = exp(b[1])
    Σ_f = exp(b[2])
    Σ_u = [Σ_m 0.0;
           0.0  Σ_f]

    #Set up thetas
    theta_0 = [b[3] b[4];
                 b[5] b[6]]
    theta_1 = [b[7:8]';
    theta_2 = [b[11:12]';

    #set up Cov matrix for shocks to measures

    sigma_1 = exp(b[15])
    sigma_2 = exp(b[17])
    cov12 = -sqrt(sigma_1)sqrt(sigma_2) + 2.0*sqrt(sigma_1)sqrt(sigma_2)*(exp(b[16]))/(1.0+exp(b[16])) #ensuring pos semi def cov matrix

    Σ = [sigma_1 cov12;
          cov12 sigma_2]

    mean_price = b[18]

    rm = b[19:21]
    rf = b[22:24]

    risk = exp(b[25])  #exp to keep risk coefficient positive
    p = (
        Σ_m = Σ_m,   
        Σ_f = Σ_f,
        Σ_u = Σ_u,
        theta_0 = theta_0,
        theta_1 = theta_1,
        theta_2 = theta_2,
        Σ = Σ,
        mean_price = mean_price,
        rm = rm,
        rf = rf,
        risk = risk)  #exp to keep risk coefficient positive
    return p
function DrawUnobsData(n,Σ)
    svectorscopy(rand(MvNormal([0.0, 0.0],Σ),n), Val{2}()) 
function svectorscopy(x::Array{T}, ::Val{N}) where {T,N}
    size(x,1) == N || error("sizes mismatch")
    isbitstype(T) || error("use for bitstypes only")
    copy(reinterpret(SVector{N,T}, vec(x)))
function bstarCalc(ats::SVector{2, Float64},rΣ)
function SCalc(ats::SVector{2, Float64},bs::SVector{2, Float64})    
function eltorows(v::Vector{SVector{dim, T}}) where {dim, T} 
    copy(transpose(reshape(reinterpret(Float64, v), dim, :)))
function SumSVec(v::SVector{2, Float64}) 
function MultSVecs(v1::SVector{dim, T}, v2::SVector{dim, T}) where {dim, T} 
function ConcatSVecs(v1::Float64, v2::Float64) 
function a0abeps(a0::SVector{1, Float64},ab::SVector{2, Float64},eps::SVector{2, Float64}) 
    SVector{2,Float64}(a0 .+ ab .+ eps)
function absum(a::SVector{1, Float64},b::SVector{1, Float64},mp::Float64) 
    SVector{1,Float64}(a .+ b .+ mp)
function BMCalc(b::SVector{2, Float64},m::SVector{2, Float64})    

struct SimOutputs
    down_match :: Vector{SVector{3,Float64}} 
    wages_match :: Vector{SVector{1,Float64}} 
    measures_match :: Vector{SVector{2,Float64}}
function SimRawData(b::Vector{Float64})
    CR = CompParams();
    Raw = StrucParams(b);
    # Types
    # Upstream observed: x_1, x_2
    # Upstream unobs: delta^m
    # Downstream Observed: y_1, y_2
    # Downstream Unobs: delta^f  
    up_data = zeros(2, CR.n_firms);
    up_data[1,:] = rand(Uniform(0.0,1.0), CR.n_firms);
    up_data[2,:] = rand(Uniform(0.0,1.0), CR.n_firms);

    down_data = zeros(2, CR.n_firms);
    down_data[1,:] = rand(Uniform(0.0,1.0), CR.n_firms);
    down_data[2,:] = rand(Uniform(0.0,1.0), CR.n_firms);

    up_all = zeros(3, CR.n_firms);
    down_all = zeros(3, CR.n_firms);
    up_all[1:2,:] = up_data;
    down_all[1:2,:] = down_data;
    up_all[3,:] = rand(Normal(0.0,Raw.Σ_m[1,1]), CR.n_firms);
    down_all[3,:] = rand(Normal(0.0, Raw.Σ_f[1,1]), CR.n_firms);

    eps_measure = rand(MvNormal([0.0, 0.0],Raw.Σ),CR.n_firms);

    alpha_0 = up_all[1:2,:]'*Raw.theta_0*down_all[1:2,:];
    alpha_1 = up_all[1:2,:]'*Raw.theta_1*down_all[1:2,:];
    alpha_2 = up_all[1:2,:]'*Raw.theta_2*down_all[1:2,:];

    alpha_tilde = zeros(2,CR.n_firms,CR.n_firms);
    alpha_tilde[1,:,:] = alpha_0+alpha_1; 
    alpha_tilde[2,:,:] = alpha_0+alpha_2;

    bstar = zeros(2,CR.n_firms,CR.n_firms);
    for n1 in 1:CR.n_firms;
        for n2 in 1:CR.n_firms;
             bstar[:,n1,n2] = (alpha_tilde[:,n1,n2]*ones(1,2).+(Raw.risk*Raw.Σ))\alpha_tilde[:,n1,n2]

    S = zeros(CR.n_firms,CR.n_firms);
    for n1 in 1:CR.n_firms;
        for n2 in 1:CR.n_firms;
            S[n1,n2] = 0.5*alpha_tilde[:,n1,n2]'*bstar[:,n1,n2];
    #set up unobserved heterogeneity rhos:
    #assume rho_m = y(1)*rm(1)*delta_m + y(2)*rm(2)*delta_m + rm(3)*delta_m
    #assume rho_f = x(1)*rf(1)*delta_f + x(2)*rf(2)*delta_f + rf(3)*delta_f

    rho_m = ([down_all[1:2,:]' ones(CR.n_firms,1)]* Raw.rm * up_all[3,:]')';
    rho_f = [up_all[1:2,:]' ones(CR.n_firms,1)]* Raw.rf * down_all[3,:]';

    C = -S .- rho_m .- rho_f; #negative of pairwise quadratic surplus 

    match, up_profit_lp, down_profit_lp = find_best_assignment(C)
    down_match_lp=  Array{Float64, 2}(undef, 3, CR.n_firms);

    for i in 1:CR.n_firms
        down_match_lp[:,i] = down_all[:, match[i][2]];
    down_match_profit_lp =  Array{Float64, 1}(undef, CR.n_firms);
    for i in 1:CR.n_firms
        down_match_profit_lp[i] = down_profit_lp[match[i][2]];

  # minimum downstream profit =0 
    profit_diff = findmin(down_match_profit_lp)[1];
    down_match_profit_lp .= down_match_profit_lp .+ profit_diff;
    up_profit_lp .= up_profit_lp .- profit_diff;

    alpha_0_match = diag(up_all[1:2,:]'*Raw.theta_0*down_match_lp[1:2,:]);

    alpha_1_match = diag(up_all[1:2,:]'*Raw.theta_1*down_match_lp[1:2,:]);

    alpha_2_match = diag(up_all[1:2,:]'*Raw.theta_2*down_match_lp[1:2,:]);

    alpha_tilde_match = zeros(2,CR.n_firms);
    alpha_tilde_match[1,:] = alpha_0_match.+alpha_1_match; 
    alpha_tilde_match[2,:] = alpha_0_match.+alpha_2_match;

    bstar_match = zeros(2,CR.n_firms);
    measures_match = zeros(2,CR.n_firms); 
    for n1 in 1:CR.n_firms;
        bstar_match[:,n1] = (alpha_tilde_match[:,n1]*ones(1,2)+Raw.risk*Raw.Σ)\alpha_tilde_match[:,n1];
        #simulate matched measures
        measures_match[1,n1] = sum(alpha_0_match[n1]*bstar_match[:,n1]) + alpha_1_match[n1]*bstar_match[1,n1] + eps_measure[1,n1];
        measures_match[2,n1] = sum(alpha_0_match[n1]*bstar_match[:,n1]) + alpha_2_match[n1]*bstar_match[2,n1] + eps_measure[2,n1];

    S_match = zeros(CR.n_firms);
    for n1 in 1:CR.n_firms;
        S_match[n1] = 0.5*alpha_tilde_match[:,n1]'*bstar_match[:,n1]
    rho_m_match = diag(([down_match_lp[1:2,:]' ones(CR.n_firms,1)]* Raw.rm * up_all[3,:]')');
    rho_f_match = diag([up_all[1:2,:]' ones(CR.n_firms,1)]* Raw.rf * down_match_lp[3,:]');

    up_valuation = -S_match .+ rho_m_match; #manager gets -S_match + unobs value, net of transfers

    # eq transfers
    up_prices_lp = up_profit_lp .- up_valuation;

    #down_prices_lp= γ'*up_prices_lp;

    #simulate wages:
    wages_match = zeros(CR.n_firms)
    for n1 in 1:CR.n_firms;
        wages_match[n1] = Raw.mean_price + up_prices_lp[n1] + bstar_match[:,n1]'*measures_match[:,n1]
    down_match_ret = down_match_lp[1:2,:]
    out = (
        up_data = up_data, 
        down_match_ret = down_match_ret, 
        wages_match = wages_match, 
        measures_match = measures_match)

    return out
function sim_data_like(up_data_obs_in,down_data_obs_in,p_in::NamedTuple,i)
    CR = CompParams();
    # Types
    # Upstream x_1, x_2, delta^m
    # Downstream y_1, y_2, delta^f

    # Generate simulated x and y data

    up_data_sim  = Matrix{Float64}(undef, 3, CR.n_firms);
    up_data_sim[1:2,:] = up_data_obs_in;

    down_data_sim = similar(up_data_sim);
    down_data_sim[1:2,:] = down_data_obs_in;

    up_data_sim[3,:] = rand(Normal(0.0,p_in.Σ_u[1,1]), CR.n_firms);
    down_data_sim[3,:] = rand(Normal(0.0, p_in.Σ_u[2,2]), CR.n_firms);

    # noise to measures
    eps_measure_sim = Vector{SVector{2,Float64}}(undef,CR.n_firms)
    eps_measure_sim = DrawUnobsData(CR.n_firms , p_in.Σ)

    alpha_0_sim = SMatrix{CR.n_firms,CR.n_firms,Float64}
    alpha_1_sim = similar_type(alpha_0_sim)
    alpha_2_sim = similar_type(alpha_0_sim)
    alpha_0_sim = SMatrix{CR.n_firms,2,Float64}(up_data_obs_in')*SMatrix{2,2,Float64}(p_in.theta_0)*SMatrix{2,CR.n_firms,Float64}(down_data_obs_in);
    alpha_1_sim = SMatrix{CR.n_firms,2,Float64}(up_data_obs_in')*SMatrix{2,2,Float64}(p_in.theta_1)*SMatrix{2,CR.n_firms,Float64}(down_data_obs_in);
    alpha_2_sim = SMatrix{CR.n_firms,2,Float64}(up_data_obs_in')*SMatrix{2,2,Float64}(p_in.theta_2)*SMatrix{2,CR.n_firms,Float64}(down_data_obs_in);

    alpha01 = similar_type(alpha_0_sim)
    alpha02 = similar_type(alpha_0_sim)
    alpha01 = alpha_0_sim + alpha_1_sim
    alpha02 = alpha_0_sim + alpha_2_sim
    alpha_tilde_simVec = Vector{SVector{2,Float64}}(undef,CR.n_firms*CR.n_firms)
    alpha_tilde_simVec .= ConcatSVecs.(vec(alpha01),vec(alpha02))
    bstar_sim = similar(alpha_tilde_simVec)
    rΣ = Matrix{Float64}(undef,2,2)
    rΣ = p_in.risk * p_in.Σ   
    bstar_sim .= bstarCalc.(alpha_tilde_simVec, Ref(rΣ))
    @time S_simvec = Vector{SVector{1,Float64}}(undef, CR.n_firms*CR.n_firms)
    S_simvec .= SCalc.(alpha_tilde_simVec,bstar_sim) 
    S_sim = SMatrix{CR.n_firms,CR.n_firms,Float64}
    S_sim = SMatrix{CR.n_firms,CR.n_firms,Float64}(reshape(eltorows(S_simvec), (CR.n_firms,CR.n_firms)))
    rho_m_sim = SMatrix{CR.n_firms,CR.n_firms,Float64}
    rho_f_sim = similar_type(rho_m_sim)
    rho_m_sim = SMatrix{CR.n_firms,1,Float64}(up_data_sim[3,:]')*SMatrix{1,3,Float64}(p_in.rm)*SMatrix{3,CR.n_firms,Float64}([down_data_obs_in; ones(1, CR.n_firms)]);
    rho_f_sim = SMatrix{CR.n_firms,3,Float64}([up_data_obs_in' ones(CR.n_firms,1)])*SMatrix{3,1,Float64}(p_in.rf')*SMatrix{1,CR.n_firms,Float64}(down_data_sim[3,:]');
    C_sim = SMatrix{CR.n_firms,CR.n_firms,Float64}
    C_sim = -S_sim - rho_m_sim - rho_f_sim; #negative of pairwise quadratic surplus 

    match_sim, up_profit_sim, down_profit_sim = find_best_assignment(C_sim)

    down_match_sim = similar(up_data_sim);

    for i in 1:CR.n_firms
        down_match_sim[:,i] = down_data_sim[:, match_sim[i][2]];

    down_match_profit_sim =  Array{Float64, 1}(undef, CR.n_firms);
    for i in 1:CR.n_firms
        down_match_profit_sim[i] = down_profit_sim[match_sim[i][2]];

    # minimum downstream profit =0 
    profit_diff_sim = findmin(down_match_profit_sim)[1];
    down_match_profit_sim .= down_match_profit_sim .+ profit_diff_sim;
    up_profit_sim .= up_profit_sim .- profit_diff_sim;

    alpha_0_match_sim = SVector{CR.n_firms,Float64}
    alpha_1_match_sim = similar_type(alpha_0_match_sim)
    alpha_2_match_sim = similar_type(alpha_0_match_sim)
    alpha_0_match_sim = diag(SMatrix{CR.n_firms,2,Float64}(up_data_obs_in')*SMatrix{2,2,Float64}(p_in.theta_0)*SMatrix{2,CR.n_firms,Float64}(down_match_sim[1:2,:]));   
    alpha_1_match_sim = diag(SMatrix{CR.n_firms,2,Float64}(up_data_obs_in')*SMatrix{2,2,Float64}(p_in.theta_1)*SMatrix{2,CR.n_firms,Float64}(down_match_sim[1:2,:]));
    alpha_2_match_sim = diag(SMatrix{CR.n_firms,2,Float64}(up_data_obs_in')*SMatrix{2,2,Float64}(p_in.theta_2)*SMatrix{2,CR.n_firms,Float64}(down_match_sim[1:2,:]));
    alpha01_match = similar_type(alpha_0_match_sim)
    alpha02_match = similar_type(alpha_0_match_sim)
    alpha01_match = alpha_0_match_sim + alpha_1_match_sim
    alpha02_match = alpha_0_match_sim + alpha_2_match_sim

    alpha_tilde_match_simVec = Vector{SVector{2,Float64}}(undef,CR.n_firms)
    alpha_tilde_match_simVec .= ConcatSVecs.(vec(alpha01_match),vec(alpha02_match))
    alpha_0_match_simVec = copy(reinterpret(SVector{1,Float64}, vec(alpha_0_match_sim)))
    bstar_match_sim = similar(alpha_tilde_match_simVec)
    bstar_match_sim .= bstarCalc.(alpha_tilde_match_simVec, Ref(rΣ))

    sumBstar = Vector{SVector{1,Float64}}(undef, CR.n_firms)
    sumBstar .= SumSVec.(bstar_match_sim)
    a0sumBstar = Vector{SVector{1,Float64}}(undef, CR.n_firms)
    a0sumBstar .= MultSVecs.(alpha_0_match_simVec,sumBstar)
    measures_match_sim = similar(bstar_match_sim)
    a12 = similar(bstar_match_sim)
    a12 .= ConcatSVecs.(vec(alpha_1_match_sim),vec(alpha_2_match_sim))
    ab = similar(bstar_match_sim)
    ab .= MultSVecs.(a12,bstar_match_sim)
    measures_match_sim .= a0abeps.(a0sumBstar,ab,eps_measure_sim)

    S_match_simvec = Vector{SVector{1,Float64}}(undef, CR.n_firms)
    S_match_simvec .= SCalc.(alpha_tilde_match_simVec,bstar_match_sim) 

    rho_m_match_sim = diag(SMatrix{CR.n_firms,1,Float64}(up_data_sim[3,:]')*SMatrix{1,3,Float64}(p_in.rm)*SMatrix{3,CR.n_firms,Float64}([down_match_sim[1:2,:]; ones(1, CR.n_firms)]));
    rho_m_match_simSvec = copy(reinterpret(SVector{1,Float64}, rho_m_match_sim))
    #rho_f_match_sim = diag([up_data_sim[1:2,:]' ones(CR.n_firms,1)]* p_in.rf * down_match_sim[3,:]');

    up_valuation_sim = -S_match_simvec .+ rho_m_match_simSvec; #manager gets -S_match + unobs value, net of transfers
    up_profit_simSvec = copy(reinterpret(SVector{1,Float64}, up_profit_sim))
    # eq transfers
    up_prices_sim = up_profit_simSvec .- up_valuation_sim;
    #down_prices_sim= γ_sim'*up_prices_sim;

    #simulate wages:

    wages_match_sim = Vector{SVector{1,Float64}}(undef, CR.n_firms)
    bm = Vector{SVector{1,Float64}}(undef, CR.n_firms)
    bm .= BMCalc.(bstar_match_sim,measures_match_sim)
    wages_match_sim .=  absum.(up_prices_sim, bm,p_in.mean_price);
    down_match_sim_out = Vector{SVector{3,Float64}}(undef, CR.n_firms)
    down_match_sim_out = svectorscopy(down_match_sim, Val{3}())
    down_match = down_match_sim_out 
    wages_match = wages_match_sim
    measures_match = measures_match_sim
    Simout = SimOutputs(down_match, wages_match, measures_match)
    return Simout

#liklihood function 
function loglikepr(b_est,b_cal,i_cal,up_data_obs,down_data_obs,wages_obs,measures_obs)
    # inputs are:
    # b_est = parameter values of parameters to be estimated
    # b_cal = parameters values of parameters to be calibrated
    # i_cal = index in b for parameters to be calibrated
    # up_data_obs - the observable part of up_data
    # down_data_obs - the obs part of down_data
    # wages_obs - obs wages
    # measures_obs = obs measures
    LLC = CompParams();
    ic = 1;
    ie = 1;
    nb = length(b_est)+length(b_cal);
    b = zeros(nb);
    for ib in 1:nb
        if ic<=length(i_cal)      
            if ib == i_cal[ic];
                b[ib] = b_cal[ic];
                ic += 1;
                b[ib] = b_est[ie]; 
                ie += 1;
            b[ib] = b_est[ie];
            ie += 1;

    LLp = StrucParams(b);

    @time sim_dat = Vector{SimOutputs}(undef,LLC.n_sim)
    sim_dat = [sim_data_like(up_data_obs,down_data_obs,LLp,314159+i) for i in eachindex(sim_dat)]

    # Bandwidths for y, wages, measures, cov of measures
    data = zeros(5,LLC.n_firms*LLC.n_sim);
    for i = 1:LLC.n_sim
        data[1,1+(i-1)*LLC.n_firms:i*LLC.n_firms] = view(down_data_obs,1,:) .- getindex.(sim_dat[i].down_match,1);
        data[2,1+(i-1)*LLC.n_firms:i*LLC.n_firms] = view(down_data_obs,2,:) .- getindex.(sim_dat[i].down_match,2);
        data[3,1+(i-1)*LLC.n_firms:i*LLC.n_firms] = (wages_obs) .- getindex.(sim_dat[i].wages_match,1);
        data[4,1+(i-1)*LLC.n_firms:i*LLC.n_firms] = view(measures_obs,1,:) .- getindex.(sim_dat[i].measures_match,1);
        data[5,1+(i-1)*LLC.n_firms:i*LLC.n_firms] = view(measures_obs,2,:) .- getindex.(sim_dat[i].measures_match,2);
   if (LLC.hmethod==1)
        #Silverman's Rule of thumb:
        Iqrnorm = 1.34;
        A = zeros(5);
        A[1] = min(std(data[1,:]),StatsBase.iqr(data[1,:])/Iqrnorm);
        A[2] = min(std(data[2,:]),StatsBase.iqr(data[2,:])/Iqrnorm);
        A[3] = min(std(data[3,:]),StatsBase.iqr(data[3,:])/Iqrnorm);
        A[4] = min(std(data[4,:]),StatsBase.iqr(data[4,:])/Iqrnorm);
        A[5] = min(std(data[5,:]),StatsBase.iqr(data[5,:])/Iqrnorm);
       h=0.9 * A * LLC.n_firms^(-0.2);
        #Using Sheather and Jones (1991) bandwidth
        h = zeros(5);
        h[1] = KernelDensitySJ.bwsj(data[1,:]); 
        h[2] = KernelDensitySJ.bwsj(data[2,:]);
        h[3] = KernelDensitySJ.bwsj(data[3,:]);
        h[4] = KernelDensitySJ.bwsj(data[4,:]);
        h[5] = KernelDensitySJ.bwsj(data[5,:]);
    like = zeros(LLC.n_firms);
    for i in 1:LLC.n_firms
        for j in 1:LLC.n_sim
                pdf(Normal(),(down_data_obs[1,i] - getindex(sim_dat[j].down_match[i],1))/h[1])
                *pdf(Normal(),(down_data_obs[2,i] - getindex(sim_dat[j].down_match[i],2))/h[2])
                *pdf(Normal(),(wages_obs[i] - getindex(sim_dat[j].wages_match[i],1))/h[3])
                *pdf(Normal(),(measures_obs[1,i] - getindex(sim_dat[j].measures_match[i],1))/h[4])
                *pdf(Normal(),(measures_obs[2,i] - getindex(sim_dat[j].measures_match[i],2))/h[5])
    if LLC.trim_percent > 0
        #set up trimming function following Fermanian and Salanie
        hall = prod(h);
        delta_trim = log(percentile(like,LLC.trim_percent))/log(hall); 
        if isinf(abs(delta_trim)) #if above trim fn yields computational zeros, then we need less smooth trim fn        
            hd = percentile(like,LLC.trim_percent)
            tau_trim =zeros(LLC.n_firms)
            tau_trim .= (like.-hd)./(2.0*hd) 
            tau_trim[like .< hd] .=0.0
            tau_trim[like .> 2.0*hd] .=1.0  
            maxtau = findmax(tau_trim) 
            hd =  hall^delta_trim; #ensure hd, hd3, hd4 are not computationally zero
            hd3 = hd^3.0;
            hd4 = hd^4.0;
            tau_trim = zeros(LLC.n_firms)
            tau_trim .= 4 .*(like.-hd).^3 ./ hd3 .- 3.0 .*(like.-hd).^4.0 ./ hd4;
            tau_trim[like .< hd] .=0.0
            tau_trim[like .> 2.0*hd] .=1.0
            maxtau = findmax(tau_trim)
        tau_trim = zeros(LLC.n_firms)
        maxtau = findmax(tau_trim)

    ll = 0.0 
    if tau_trim==0.0 || isnan(maxtau[1])
        ll = NaN
        for i in 1:LLC.n_firms
            if tau_trim[i]>0.0    
    println("parameter: ", round.(b_est, digits=5), " likelihood: ", ll/LLC.n_firms)
    return ll/LLC.n_firms 

function SetTrueParams()
  # Set up true parameters
  b_true = zeros(25); 
  #unobs heterogeneity
  Σ_m_true = exp(0.0);
  b_true[1] = log(Σ_m_true);
  Σ_f_true = exp(0.0);
  b_true[2] = log(Σ_f_true);

  #Set up thetas

  theta_0_true = [0.5 0.7; 
                0.3 0.1];
  b_true[3]= theta_0_true[1,1];
  b_true[4]= theta_0_true[1,2];
  b_true[5]= theta_0_true[2,1];
  b_true[6]= theta_0_true[2,2];

  theta_1_true = [0.2 0.05; 
                0.2 0.4];
  b_true[7]= theta_1_true[1,1];
  b_true[8]= theta_1_true[1,2];
  b_true[9]= theta_1_true[2,1];
  b_true[10]= theta_1_true[2,2];

  theta_2_true = [0.1 0.0;
                 0.0 0.2];
  b_true[11]= theta_2_true[1,1];
  b_true[12]= theta_2_true[1,2];
  b_true[13]= theta_2_true[2,1];
  b_true[14]= theta_2_true[2,2];
  #set up Cov matrix for shocks to measures

  sigma_1_true = 0.3;
  b_true[15] = log(sigma_1_true);
  sigma_2_true = 0.1;
  b_true[17] = log(sigma_2_true);
  #cov btw 1 &2:
  b_true[16]= 1.02;
  cov12_true = -sqrt(sigma_1_true)sqrt(sigma_2_true) + 2*sqrt(sigma_1_true)sqrt(sigma_2_true)*(exp(b_true[16]))/(1+exp(b_true[16])); #ensuring pos semi def cov matrix
  mean_price_true = 0.0;
  b_true[18] = mean_price_true;

  rm_true = [0.1 -0.3 0.2];
  b_true[19:21] = rm_true;
  rf_true = [0.0 0.2 0.1];
  b_true[22:24] = rf_true;

  risk_true = 0.05;
  b_true[25] = log(risk_true);
    return b_true

b_true = SetTrueParams();
Rawout = SimRawData(b_true);

igrid = Vector{Integer}(1:length(b_true));
ical = deleteat!(igrid,1);
b_cal = b_true[ical];
b_est = [b_true[1]+.01;
@time loglikepr(b_est,b_cal,ical,Rawout.up_data,Rawout.down_match_ret, Rawout.wages_match, Rawout.measures_match)

It seems that combining the DNF’s and jules’ tips would be good first steps.

  • There many examples in the code where arrays are allocated, but then that allocation is ignored since = instead of .= was used. Moreover, it is more performant to do operations in-place, e.g. with for loops.

Now, it is a bit of a recurring theme, but the use of StaticArrays might decrease the allocation counter, but neither SMatrix{50, 50, Float64} nor Vector{SVector{1,Float64}} give you real performance benefits. Such large matrices are like immutable structs with 2500 elements, that’s usually just a lot of work for the compiler and it will actually never work with n_firms = 5000!

There are many small things, but my #1 advice would be to remove all use of StaticArrays and by doing so it will be much easier to simplify and improve the code. Use .= and .+ where needed, ensure that you don’t allocate memory for computations on those square matrices. For complicated operations, use explicit for loops instead of vectorized operations with complicated types.

Now, about S_sim, C_sim and the other matrices of this size:
It could indeed make sense to cache those variables, e.g. create them as normal Matrix before calling the function.

Even with a normal matrix type you can avoid many allocations, for example instead of

rho_m_sim = SMatrix{CR.n_firms,CR.n_firms,Float64}
rho_m_sim = SMatrix{CR.n_firms,1,Float64}(up_data_sim[3,:]')*SMatrix{1,3,Float64}(p_in.rm)*SMatrix{3,CR.n_firms,Float64}([down_data_obs_in; ones(1, CR.n_firms)]);

you can just allocate rho_m_sim once before you call sim_data_like. E.g. with rho_m_sim = Matrix{Float64}(undef, CR.n_firms, CR.n_firms) and then write

for i in 1:CR.n_firms, j in 1:CR.n_firms
    rho_m_sim[i,j] = up_data_sum[3,i] * (p_in.rm[1] * down_data_obs_in[j] + p_in.rm[2] + p_in.rm[3])

This is easier to read and it avoids allocation for intermediate results as well. (In the same way you can rewrite most of the calculations.)
Sure, there are super cool packages which can do that kind of operations fast and in vector notation, but keeping it simple has also some advantages :wink:

You probably know about it, but just in case, there is a nice package called GitHub - KristofferC/TimerOutputs.jl: Formatted output of timed sections in Julia for timing different sections. And something I learned far too late is to use the profiler from VS Code or Juno. I often optimised parts of the code which turned out to be irrelevant, so using the profiler as a first tool is often useful.

Ooh. Yeah, definitely agree. All the static arrays are just causing a lot of chaos here. And there are two central points for StaticArrays: they should be small, and their sizes should be known by the compiler. So this

will not be beneficial at all. Unless I am mistaken, CR.n_firms is a value that is not known by the compiler, and it is not small. So this violates the assumptions that make them fast.

Totally agree with @SteffenPL, remove every mention of StaticArrays inside your functions. Generally, it’s better to write your code so that it will work with different array types, and then you call it with static arrays as inputs to reap performance benefits (though this is a bit too much to demonstrate here.)

In general, your code seems very complicated, and I think it should be possible to simplify it quite a bit. Things like the below are just very wasteful, instantiates Uniform(0, 1) four times, initializes up_data and sets its values to zero, then overwrites those zero values(!). Each rand call allocates a new array, which is then copied over in an inefficient way, along the rows of up_data/down_data.

and can be replaced with:

distr = Uniform(0.0,1.0)  # create distr once only
up_data = rand(distr, 2, CR.n_firms) # first allocated array
down_data = rand(distr, 2, CR.n_firms) # second allocated array

which is much simpler and allocates just two arrays instead of six, and only one distribution object instead of four (though Uniform(0,1) is very lightweight, this is much worse if you have something like MvNormal([...], [...])).

Your code seems to be very row-oriented:

Lots of operations along rows everywhere. But Julia arrays are column-major, so this is sub-optimal. Also each operation data[1, :] etc. allocates a new array, and each line does this twice.

1 Like