Newbie Question: Julia optimization multiprocessing

Hi, I’m trying to figure out whether Julia is a good language for a project I’m undertaking. One thing I’ll need to do is optimize a complex statistical function (Hidden Markov / CRF related) involving potentially 12K parameters. I’m thinking that L-BFGS would be a good optimization method.

I’m wondering whether Julia’s implementation of this is set up to do optimization with multiple cores and perhaps distributed over multiple computers? I imagine that it shouldn’t be hard to run several separate optimizations simultaneously (tasks?)? Any idea if optimization speeds in Julia approach that of C? (Then again, I hear setting up multi-processing in C is quite involved. Am trying to decide between Java/Scala, C, and Julia; am somewhat familiar with Java).

Jason

The L-BFGS implementation in Optim.jl is not parallelized but that doesn’t seem like it would be important for your use case since it’s usually the objective function and its derivatives that needs to be sped up, not the L-BFGS internal steps. So I’d guess the performance questions depend on whether you feel confident in your ability to implement those functions as efficiently as you would in C.

Thanks for the reply!

I’m currently running L-BFGS in Stan, which compiles as C++ and also runs unparallelized. The HMM code in Stan uses fairly standard efficient algorithms. If there is a closed-form derivative for this, I imagine it would be quite ugly.

Given what I’m seeing in Stan on artificial data with relatively few parameters, I’m guessing it could take days to optimize the parameters with a sufficient quantity of real data and a realistic number of hidden states and emissions. Thus my interest in parallelization.

Such analysis would, however, require several random starts to help reduce the chance that I’m at a local optimum (or to test alternative values for the number of hidden states), so I guess I can save time by simply running separate optimizations in parallel (which is possible?)–if this doesn’t require too much memory. There would be advantages to being able to run one analysis in parallel, but probably not critical ones.

Incidentally, is it straightforward to get an estimated Hessian for the L-BFGS optimized parameters in Julia? I didn’t notice an option for this.

Thanks again, Jason

You can usually get exact Hessians using the ForwardDiff library and calling the hessian function. That’s what I do to estimate observed Fisher information in my work.

Incidentally, I found a few potential solutions for obtaining nicely
parallelized l-bfgs code from within Julia.

One might be to use Julia’s ability to call Python (which looks really
good) and then have Python run Apache Spark, which has well parallelized
L-BFGS.

Another possibility is to utilize the TAO component of the C++ package
PETSc, which can utilize GPUs, multiple cores, and multiple machines.
Would likely have to write a C wrapper though and hope there are no hiccups
with data transfer.

And, of course, there’s the possibility of adding some multi-core
parallelization to Julia itself.

Julia already has it. See the docs.

See PETSc.jl

Most likely, as @johnmyleswhite noted, the speedups will come not from parallelizing L-BFGS itself, but parallelizing either your objective function calculation, or giving it faster (parallel) Jacobian or Hessian functions. Such derivatives can be computed very easily using ForwardDiff, and ForwardDiff has multithreading built in (though not multiprocessing: you’d have to set that up using Julia’s parallelism features). Use ProfileView.jl to find out where the actual bottleneck is first, otherwise it’s hard to know what the actual problem is.

And apparently there’s a spark.jl as well–perfect.

Thanks Chris, this is all very helpful!

Hello,

I was searching for a thread like this and thought I might ask this here as it seems most relevant to the above discussion. Is it possible to pass a parallel function i.e. a function being evaulated using pmap or @distributed to ForwardDiff.hessian()? For example, would it make sense to implement

ForwardDiff.hessian(x → my_parallel_function(x), evaluate_hessian_at, cfg, Val{false}())

where cfg = ForwardDiff.HessianConfig(x → my_parallel_function(x), evaluate_hessian_at, Chunk{n}()); n <= dim(x)?