Next_step! resulting in slightly different results for the same inputs

Some First Things

First things first. Most of the data I’m about to mention comes from

https://docs.sciml.ai/SciMLBenchmarksOutput/stable/

and we’re about to do another tag on it, so current dev is much more updated. This is because, as mentioned in CVODE_BDF outperforms Julia solvers for stiff system biology model? - #5 by ChrisRackauckas, we had a pretty major improvement to the nonlinear solver heuristics which improved a good chunk.

And with this, when I talk performance it’s work-precision diagrams so that accuracy is aligned: we know that CVODE tends to give errors much higher than its tolerance and so that always needs to be accounted for.

Back to the main show: Where we do well

So okay with that out of the way, performance. That depends on many factors. First of all, for smaller systems like 20 or less the Rosenbrock methods are clearly outperform CVODE so use that. In the 10-200 range or so, a well-tuned parallel implicit extrapolation usually does best so use that. For DAEs, the mass matrix form almost always outperforms IDA so use FBDF in mass matrix form or a Rosenbrock method.

Remaining cases

What that means is that there are two regimes where CVODE is identified as still doing well: high f call cost and “sufficiently large”.

When I say sufficiently large, I mean “ODEs larger than 200”, but there’s a bit of a catch on this. Until sometime this year it was the case that we did have some hiccups with the dense linear solvers, and this was due to OpenBLAS defaults. Long story short is that Julia v1.12 improves BLAS threads defaults and then we also have some nice improvements to the default BLAS chooser by incorporating MKL into the mix. To show this, here’s a slide from a workshop I recently gave:

You can see that the standard BLAS LUFactorization() is pretty bad in almost all cases , but how bad depends on the CPU you are using, and GenericLUFactorization() could beat it. Our Sundials wrapper always defaults to the equivalent of GenericLUFactorization(), you can set LapackDense but that’s not the default, which then ironically means it’s better in the cases where BLAS defaults are just bad (such as right after the threading threshold). The changes we did improved the heuristics for the choices. I think for all CPUs we now at least always do better than the naive choice, though you can see for example on AMD CPUs there is a spot at like N=250 where the multithreading threshold kicks in too early, but it still beats OpenBLAS and no BLAS pretty handedly so that’s fine. With those improvements, the “large enough and dense” region should now be handled well on our end.

But the “sufficiently large” with sparse matrix is the one that still has something missing. You can see it here on Bruss, before the nonlinear solver change:

image

After the nonlinear solver change:

image

The difference isn’t all that large and it’s only when the tolerances are sufficiently small, but it does exist. My guess is there’s an issue with the symbolic factorization reuse since that’s almost precisely the cost there, so I may try to track that down sometime in the near future.

For the high f call cost, that was something only recently isolated in Create astrochem benchmark by ChrisRackauckas · Pull Request #1046 · SciML/SciMLBenchmarks.jl · GitHub so we’ll get those merged and up soon. Since these two benchmarks are only in the CI right now, let me share the results:

image

image

The first one isn’t bad for the latter one is curiously bad for FBDF. Since CVODE and FBDF are very similar algorithms, that shouldn’t happen. This seems to be something arising in chemical reaction networks with very high f call costs, where this arises because of ^ in hill functions turns into the dominating factor. You can see this in the profile here for the last one:

where >90% of the time is in ^ calls in the right hand side. In that case, the benchmark is “abnormal” from an implicit ODE solver context as it’s normally assumed that it’s somewhat LU-bound, but that’s clearly not the case here. I suspect this is the case with @sebapersson’s examples too and I would like help turning them into reproducible examples on SciMLBenchmarks if anyone has a second. That would be the most useful thing here.

What’s next?

So then back to:

There’s another way to interpret this question as well, which is, why have we not had a focus on getting a solver that covers the CVODE cases? Because indeed, we haven’t really cared all that much to handle those cases for a good reason: CVODE is a really good method, we have really good (I think you can argue the best) open source wrappers of Sundials in a high-level language, and we have no plans to stop maintaining the wrappers. Therefore, why would we care to build more methods in the domain where CVODE is good? If CVODE is good, then call CVODE, and we can go work on other things. That’s especially true for DFBDF vs IDA, there just isn’t a strong incentive to work on it.

That said, years of poking has clearly got us pretty close. I’m sure we’ll find something with the high f cost investigations over the next few months, and @Oscar_Smith is already doing some major changes to the linear solver wrappers which could close the remaining gap in the large end there. But in general, it’s not a high priority to work on “things that have already been done” when we can just have a good wrapper.

Instead, there’s many other things to do, so the coming soon is:

  • Improved nonlinear solvers (i.e. replacing the internal nonlinear solvers of OrdinaryDiffEq with use of NonlinearSolve.jl. This is partially completed)
  • Some symbolic-numeric stuff… which will give massive performance improvements. More to come on this in the near future.
  • New parallel methods
  • Improved compilation and load times

which all take precedence. That said, the poke is getting close and I’m sure we’ll poke it closer over the next year as well.

4 Likes