I played around with this a bit. If I set M=1000, then I can get about a 30-40% speedup using 2 threads versus one. Using more threads does not help. My belief is that garbage collection interferes with performance using threads, as, from what I have read, any garbage collection will stop all threads. So, with too many threads, the probability of all of them getting stopped goes up. I have found in some experimentation that MPI gives good performance, see An embarrassingly parallel problem: threads or MPI?. To limit garbage collection when using threads, trying to avoid allocations is a good strategy, I believe.
The code I used for threading for this problem is
# indirect inference objective function
function iiobj(β, θ, x, u0, M)
ys = simulate(x, β, u0, M)
m = zeros(size(x,2))
Threads.@threads for i in 1:M
#for i in 1:M
m .+= ols(x,ys[i])
end
m = m ./ M
return sum((θ - m).^2)
end
For optimization, I set the criterion using an anonymous function:
obj = β -> iiobj(β, θ, x, u0, M)