Hi,
I have a general question about the suitability of using GPUs for the kind of problem I’m working on.
So far, I’ve had good success using DiffEqGPU for parameter sweeps, specifically in cases where the parameters don’t lead to errors and all trajectories require approximately the same number of steps to solve. In these scenarios, the GPU version (using GPURodas5P
with EnsembleGPUKernel
) is significantly faster than the CPU version (Rodas5P
with EnsembleThreads
). All runs were performed using Float32.
Handling Trajectory Failures on the GPU
I’m now working on a different problem where some parameter sets cause the solver to fail. When using the GPU, if even a single trajectory fails, the entire job terminates. On the CPU, however, failed trajectories are aborted, and the rest continue running.
I’m not sure if unstable_check
can help in the GPU case, but even if it could, it would require me to know in advance what conditions cause failures. The CPU version handles this automatically by aborting failed trajectories.
Is there a way to get similar failure handling on the GPU, so that failed trajectories can be skipped without crashing the whole batch?
Heterogeneous Trajectories and GPU Performance
Since the CPU handles failures more gracefully, I decided to continue using it. However, I also wanted to test how fast the GPU version would work if I only ran it on parameter sets for which no errors are thrown. So I ran the simulations on the CPU and filtered out the parameters of the successful solves using SciMLBase.successful_retcode(sol.retcode)
.
I then ran simulations on these filtered out parameter sets on both the CPU and the GPU. In this case, the CPU turned out to be faster than the GPU. I suspect this is due to the heterogeneity of the trajectories. Some take significantly longer than others. On a GPU, due to its SIMT (Single Instruction, Multiple Threads) architecture, the slowest trajectory in a batch can bottleneck the entire group. On the CPU, threads can continue independently, so slow trajectories do not delay the others.
One idea I can think of is to estimate the runtime of each trajectory (as a heuristic) using a shorter tspan
. I could then group trajectories with similar runtimes together, so that no single long-running simulation holds back an entire GPU batch. However, I am unsure whether it’s worth doing.
Is this a problem that may be more suited to CPU execution than GPU?