DiffEqGPU Trajectory Failure Handling and Heterogeneous Trajectories

An update on this.

Regarding the fallback to CPU for batches of trajectories that take too long to run on the GPU, I found it’s more practical to handle this at the Bash level by setting a maximum wall-clock time. If the time limit is exceeded, I switch to running the batch on the CPU instead.

As for the failure, I initially suspected a divide-by-zero error, but I put in checks to handle that. It now seems more likely to be an overflow issue, especially since I get the same error when I multiply dx by a very large value in the equation. I’m not entirely sure what causes the error, but using the CPU fallback has been an effective workaround.

Just thought I’d share this as a practical solution when working with EnsembleGPUKernel. There are definitely cases where the GPU is significantly faster than the CPU, but also cases where the CPU performs better.

That said, this fallback strategy can actually outperform relying solely on either the GPU or the CPU, especially when the probability of encountering problematic trajectories is low. The GPU will burst through those easy to solve ones while the problematic ones are offloaded to the CPU.

1 Like