Help with parallelizing indirect inference estimation

I’m trying my hand at estimation via indirect inference, and created this script to experiment. I am wondering what is the best way to parallelize the computation. I think the critical functions are simulate and the line that computes θ̃ inside the function obj. I don’t have experience in parallel computing, so I am open to any suggestions.

using Optim
using Statistics
using Distributions
using Random


# true dgp
# y = exp(β ⋅ x) + u
# u ~ 𝒩(0, σ²)

function model(x, β)
    exp.(x * β)

# number of simulations
const M = 10

# size of data
const n = 1_000
const K = 5

# generate data
const σ² = 1.5
const β = rand(K)
const udgp = rand(Normal(0, σ²), n)
const x = hcat(fill(1.0, n), rand(Normal(), (n, K - 1)))
const ydgp = model(x, β) + udgp

# auxiliary model
# y = θ ⋅ x + ε
# ε ~ 𝒩(0, σε²)

# estimator of auxiliary model
function ols(x, y)
    xt = transpose(x)
    inv(xt * x) * xt * y

# generate simulations
const u0 = [randn(n) for i in 1:M]

function simulate(x, β; u0=u0, M=M)
    [model(x, β) + u0[i] for i in 1:M]

# estimate of auxiliary model
const θ̂ = ols(x, ydgp)

# criterion to be minimized
criterion(θ̃; θ̂=θ̂) = sum((θ̂ - θ̃).^2)

# indirect inference objective function
function obj(β)
    ys = simulate(x, β)
    θ̃ = mean([ols(x, ys[i]) for i in 1:M])
    return criterion(θ̃)

# initial values of parameters of true model
β0 = ones(K)

# optimization
opt = optimize(obj, β0, method=ConjugateGradient(), autodiff=:forward)

# indirect inference estimates (of the true model)
β̂ = Optim.minimizer(opt)

I think as of now there are probably a few things to consider before going parallel immediately.
As far as I can tell, you’re right in that the function obj is likely the critical part, but it is necessary to really understand which part you want to do in parallel.
Do you want to speed up the computation that is in here, or do you want to compute several instances of this simulation in parallel? If the latter is the case, its probably easiest to use the pmap function from Distributed.
But actually, I think at this stage your code could likely gain much better benefits from removing memory allocations:

In this code for instance you create several new big matrices and vectors that you don’t have to.
Usually, for performance critical computations, it is recommended to allocate matrices and vectors before the computation and re-use it whenever possible by using function (indicated by an ! after their name) that write directly into them.
For instance, for multiplying matrices, you can use mul!(Res,A,B) from LinearAlgebra which uses the pre-allocated matrix Res to store the result. You admittedly lose a bit of the idiomatic nature of your code but I am sure it will be worth it.

You can check if your code allocates using BenchmarkTools.@btime. If performance of your code is still requiring you to parallelize, after reducing allocations significantly, then it might work to parallelize over this for loop using pmap

    θ̃ = mean([ols(x, ys[i]) for i in 1:M])

This code looks pretty good to me. Nice work! For threads parallelism, you need for loops, and you don’t have them here because of the iterator approach. I have done indirect inference using both threads and iterators, and I found that the iterator approach leads to cleaner code, and ends up being just about as fast. I suppose that threading suffers from cache contention, when the model is complex enough to use some memory. I have yet to learn how to parallelize code that uses the iterator approach, so I will be paying attention if solutions are offered.

If you happen to run Monte Carlo simulations, the iterator approach is quite a bit superior, in my experience, because the natural way to parallelize is across the Monte Carlo replications.

The file Econometrics/EstimateSV.jl at master · mcreel/Econometrics · GitHub does method of simulated moments estimation, which is pretty much the same as indirect inference, at least in terms of parallelization opportunities. You can explore the effect of threads-bases parallelization by adding Threads.@threads before the for loops in the auxstat function in SVlib.jl. I find that going from no threads to 4 threads, I get about a 10% speedup, which, for me, is not worth it. I prefer the iterator style code.

This resource seems to have information about parallelizing iterator based code: A quick introduction to data parallelism in Julia