Hi Julia Speed Guru’s,
I’m trying to demonstrate Julia’s cool capabilities for embedding shallow neural networks to select key parameters for an ensemble of differential equations. I can use the ensemble problem fine and when I solve with threads, I get speed up vs serial evaluations. But it seems like I could get much MORE speedup.
Let me describe the problem, which unfortunately I don’t have a minimum working example, so I understand if this gets no support, but really I’m looking for if I’m approaching the concept correctly and there isn’t a better way hidden deep down in the documentation.
This is what I have:
- DiffEq based simulation with states and parameters
- Several of the parameters (5-7 or so need to be optimized based on initial conditions of the states)
- A Flux Network (because I haven’t tried Lux yet and need to find a “how-to-convert-to-Lux” document) takes key initial conditions and choses the best 5-7 parameters as the network output using a cost function to evaluate the end state of the differential equation accumulated through the entire ensemble
- I create a random problem set to feed an ensemble of Monte Carlo Runs so that the network can see a variety of start conditions
- I calculate a loss/cost for each instance of the ensemble and sum it up for all training vectors
- This is all part of the Optimization Problem Solve I use AutoForwardDiff for the gradient maker as I haven’t been able to figure out all of the changes to my diffeq to make Zygote work.
- This all works.
Problem: The threading seems to only take place on the ensemble set diffeq. So if I have 24 cores say and 50 ensemble seeds it isn’t very efficient to call out to the threading. But my network has on the order of 600 parameters. So depending on ForwardDiff chunk sizes, I run those 50 training vectors threaded about 600/chunksize times. Well submitting 50 seeds to 24 cores is much less efficient than submitting 50*600/chunksize evaluations.
How do I get the threading to consider the gradient calculations in the thread submission? Is this even mathematically possible (he lazily asks). Because if I could use distributed, I actually have 960 cores at my disposal right now on an evaluation cluster. So I could get this to really move using the distributed ensemble… at least in my dreams. At the very least the single processor 24 core job would be more thread-overhead efficient.
Thoughts on how I could enable more throughput?
Are there good examples out there?
Thanks for the great support. Love the Julia community!
Best Regards,
Allan