DiffEqGPU Trajectory Failure Handling and Heterogeneous Trajectories

tosm01 · June 17, 2025, 9:33am

Hi,

I have a general question about the suitability of using GPUs for the kind of problem I’m working on.

So far, I’ve had good success using DiffEqGPU for parameter sweeps, specifically in cases where the parameters don’t lead to errors and all trajectories require approximately the same number of steps to solve. In these scenarios, the GPU version (using GPURodas5P with EnsembleGPUKernel) is significantly faster than the CPU version (Rodas5P with EnsembleThreads). All runs were performed using Float32.

Handling Trajectory Failures on the GPU
I’m now working on a different problem where some parameter sets cause the solver to fail. When using the GPU, if even a single trajectory fails, the entire job terminates. On the CPU, however, failed trajectories are aborted, and the rest continue running.

I’m not sure if unstable_check can help in the GPU case, but even if it could, it would require me to know in advance what conditions cause failures. The CPU version handles this automatically by aborting failed trajectories.

Is there a way to get similar failure handling on the GPU, so that failed trajectories can be skipped without crashing the whole batch?

Heterogeneous Trajectories and GPU Performance
Since the CPU handles failures more gracefully, I decided to continue using it. However, I also wanted to test how fast the GPU version would work if I only ran it on parameter sets for which no errors are thrown. So I ran the simulations on the CPU and filtered out the parameters of the successful solves using SciMLBase.successful_retcode(sol.retcode).

I then ran simulations on these filtered out parameter sets on both the CPU and the GPU. In this case, the CPU turned out to be faster than the GPU. I suspect this is due to the heterogeneity of the trajectories. Some take significantly longer than others. On a GPU, due to its SIMT (Single Instruction, Multiple Threads) architecture, the slowest trajectory in a batch can bottleneck the entire group. On the CPU, threads can continue independently, so slow trajectories do not delay the others.

One idea I can think of is to estimate the runtime of each trajectory (as a heuristic) using a shorter tspan. I could then group trajectories with similar runtimes together, so that no single long-running simulation holds back an entire GPU batch. However, I am unsure whether it’s worth doing.

Is this a problem that may be more suited to CPU execution than GPU?

ChrisRackauckas · June 17, 2025, 5:12pm

It’s not possible to get similar failure handling for the GPU by the ways GPUs work. What it’s supposed to do though is give NaN results on diverged trajectories, but I guess that isn’t working so open up an issue and we can get that fixed. That would be the skipping approach, and yes it’s supposed to be doing that.

They all have to do the same number of instructions as the longest solve, so others that finish early effectively “fake solve” to keep going because that’s required by the GPU architecture. This is fine for most cases where there is only minor variation (the major speedup can overcome this), but if you have an outlier trajectory that is much longer, then that will likely not do so well with this kind of GPU parallelism.

tosm01 · June 18, 2025, 7:32am

Thanks for responding Chris!

I can’t open an issue for this yet because I can’t share the data. If I find the time, I’ll create a reproducible example with different data to report the problem.

I came across this post (Won’t let me post a link): /discourse.julialang.org/t/is-it-possible-to-unsynch-the-ensemblegpuarrays/77879

And it got me thinking, could there be a way to unsync a trajectory and potentially offload it to the CPU if it throttles down the rest of the trajectories in the warp? Not exactly sure how to implement it but perhaps set a max_iter.

I think I have a potential hacky workaround if there isn’t a built-in way. I’d just need to somehow identify the trajectory which needs further processing, (on the CPU I could use the MaxIters retcode) but I don’t know what the GPU would return for that trajectory if the max_iters isn’t sufficient. Maybe even set a maximum wall clock time.

ChrisRackauckas · June 18, 2025, 8:50am

Yes, that is something we could automate. And run it on higher precision, because if it’s running long it’s likely stiffness bound and precision could be an issue.

Yes, basically just set a lower maxiters, let it cut, filter the retcodes, kick off the next batch of trajectories while async kicking off the failed ones on CPU.

tosm01 · July 22, 2025, 12:06pm

An update on this.

Regarding the fallback to CPU for batches of trajectories that take too long to run on the GPU, I found it’s more practical to handle this at the Bash level by setting a maximum wall-clock time. If the time limit is exceeded, I switch to running the batch on the CPU instead.

As for the failure, I initially suspected a divide-by-zero error, but I put in checks to handle that. It now seems more likely to be an overflow issue, especially since I get the same error when I multiply dx by a very large value in the equation. I’m not entirely sure what causes the error, but using the CPU fallback has been an effective workaround.

Just thought I’d share this as a practical solution when working with EnsembleGPUKernel. There are definitely cases where the GPU is significantly faster than the CPU, but also cases where the CPU performs better.

That said, this fallback strategy can actually outperform relying solely on either the GPU or the CPU, especially when the probability of encountering problematic trajectories is low. The GPU will burst through those easy to solve ones while the problematic ones are offloaded to the CPU.

Topic		Replies	Views
DiffEqGPU - slow parallel solving of SDEs on GPU GPU	6	413	March 3, 2024
Why does EnsembleGPUArray not save at the given time points the solution? GPU	1	360	June 4, 2022
Is it possible to unsynch the EnsembleGPUArrays? GPU	6	495	May 20, 2022
Performance of Ensemble Simulations on GPUs Modelling & Simulations	4	652	May 28, 2021
Performance problems with parallel ensemble simulation on GPU Modelling & Simulations question , gpu , performance	3	1379	February 14, 2020

DiffEqGPU Trajectory Failure Handling and Heterogeneous Trajectories

Related topics