Improving performance (Stochastic Path + Ipopt/Jump Optim on 32 cores/132RAM)

Hello! I’ve created a gist with some Julia code. Basically I generate a path of a stochastic process (size 1000 array) and use it to analytically estimate one of the parameters (via nonlinear optim on a max likelihood surface) and repeat it (both generation/estimation) 500 times for each combination of parameters. I’ve implemented some naive strategies to run it in parallel (32 cpus) and improve performance. I noticed that the memory allocation is high (a couple of TiBs). Could someone please suggest any modifications and/or general advice that could improve the performance?

I think parallelizing a single stochastic process is not useful, because the process is sequential. Is it possible to parallelize over the 500 replicates?

If you want to improve the performance, start with a single-threaded version and try to understand which function or which line of code is the slow one. Profiling helps in this, and I suggest using the profiler integrated with you IDE.

You can also try to observe the memory allocation of functions by using @time in suitable places.

It looks like the code has a lot of vector operations that will allocate something. In general it is better for performance to use loops instead of vector operations, unlike in matlab or R.

1 Like

Your gist is a little too large for casual performance imrovement, I would try to boil it down to something you can post in a comment - it will get a better response.

1 Like

Unfortunately the optimizer (Ipopt) always return an error (segmentation fault) whenever I try to parallelize it (even with just 2 threads). I’ll follow your advice and start with a single-threaded version. Thank you!

If you get segfaults with @threads, you might want to try Distributed instead.

Could you file an issue at Ipopt.jl, with a reproducer for the segfault? It would be nice to figure out what the issue is.

-viral

1 Like

I’m not having much luck running this code. For example, if I comment out everything in the function run after the line df = hcat(...) and then call run(), I get an error that the two things being hcat-ed are not compatible in shape. It’s possible that some part of your code is not working in a more basic way but for some reason you’re just not getting the most helpful error information, but whatever’s going on is plausibly not really because of threading or Ipopt.

As has already been suggested, though, the gist is a little hard to read with a specific eye to parallelism issues. I only see two Threads.@threads annotations, and they both seem to me to be on loops that should be over 1000 elements. What would be more helpful for us to look at, and would probably result in a more efficient program execution, would be to write a compartmentalized function that does the simulation and estimation, and then just toss that into a ThreadsX.map or pmap or something and see in that case if the speedup isn’t as good as you expect.

EDIT: oh, geez, I didn’t see that you already opened an example with a different MWE on the Ipopt page. Whoops.