DiffEqGPU Trajectory Failure Handling and Heterogeneous Trajectories

Hi,

I have a general question about the suitability of using GPUs for the kind of problem I’m working on.

So far, I’ve had good success using DiffEqGPU for parameter sweeps, specifically in cases where the parameters don’t lead to errors and all trajectories require approximately the same number of steps to solve. In these scenarios, the GPU version (using GPURodas5P with EnsembleGPUKernel) is significantly faster than the CPU version (Rodas5P with EnsembleThreads). All runs were performed using Float32.

Handling Trajectory Failures on the GPU
I’m now working on a different problem where some parameter sets cause the solver to fail. When using the GPU, if even a single trajectory fails, the entire job terminates. On the CPU, however, failed trajectories are aborted, and the rest continue running.

I’m not sure if unstable_check can help in the GPU case, but even if it could, it would require me to know in advance what conditions cause failures. The CPU version handles this automatically by aborting failed trajectories.

Is there a way to get similar failure handling on the GPU, so that failed trajectories can be skipped without crashing the whole batch?

Heterogeneous Trajectories and GPU Performance
Since the CPU handles failures more gracefully, I decided to continue using it. However, I also wanted to test how fast the GPU version would work if I only ran it on parameter sets for which no errors are thrown. So I ran the simulations on the CPU and filtered out the parameters of the successful solves using SciMLBase.successful_retcode(sol.retcode).

I then ran simulations on these filtered out parameter sets on both the CPU and the GPU. In this case, the CPU turned out to be faster than the GPU. I suspect this is due to the heterogeneity of the trajectories. Some take significantly longer than others. On a GPU, due to its SIMT (Single Instruction, Multiple Threads) architecture, the slowest trajectory in a batch can bottleneck the entire group. On the CPU, threads can continue independently, so slow trajectories do not delay the others.

One idea I can think of is to estimate the runtime of each trajectory (as a heuristic) using a shorter tspan. I could then group trajectories with similar runtimes together, so that no single long-running simulation holds back an entire GPU batch. However, I am unsure whether it’s worth doing.

Is this a problem that may be more suited to CPU execution than GPU?

It’s not possible to get similar failure handling for the GPU by the ways GPUs work. What it’s supposed to do though is give NaN results on diverged trajectories, but I guess that isn’t working so open up an issue and we can get that fixed. That would be the skipping approach, and yes it’s supposed to be doing that.

They all have to do the same number of instructions as the longest solve, so others that finish early effectively “fake solve” to keep going because that’s required by the GPU architecture. This is fine for most cases where there is only minor variation (the major speedup can overcome this), but if you have an outlier trajectory that is much longer, then that will likely not do so well with this kind of GPU parallelism.

Thanks for responding Chris!

I can’t open an issue for this yet because I can’t share the data. If I find the time, I’ll create a reproducible example with different data to report the problem.

I came across this post (Won’t let me post a link): /discourse.julialang.org/t/is-it-possible-to-unsynch-the-ensemblegpuarrays/77879

And it got me thinking, could there be a way to unsync a trajectory and potentially offload it to the CPU if it throttles down the rest of the trajectories in the warp? Not exactly sure how to implement it but perhaps set a max_iter.

I think I have a potential hacky workaround if there isn’t a built-in way. I’d just need to somehow identify the trajectory which needs further processing, (on the CPU I could use the MaxIters retcode) but I don’t know what the GPU would return for that trajectory if the max_iters isn’t sufficient. Maybe even set a maximum wall clock time.

Yes, that is something we could automate. And run it on higher precision, because if it’s running long it’s likely stiffness bound and precision could be an issue.

Yes, basically just set a lower maxiters, let it cut, filter the retcodes, kick off the next batch of trajectories while async kicking off the failed ones on CPU.