Thanks @ccoffrin and @odow for diving deep with me into my concerns!
I think this is the first time that I’ve seen where we’ve split a single benchmark into two very different implementations that will continue to be maintained, but in this case it seems like it is just required. Interesting, but helpful case and something we may want to consider more down the line.
The fully elaborate on what @ccoffrin is saying, the rosseta-opf is really focused on a “naive real user” so to speak, where someone just takes a model definition from a file and slaps that into a loss function. What I was setting up with the SciMLBenchmarks is something that is a comparative solver benchmark, really trying to focus on how the different algorithms are performing and scaling in order to prioritize issues to solve in order to improve general performance.
Understanding the Goals of the Two Benchmarks
At face value, it might look like those are the same thing, but there’s some nuances. One major nuance is the data read. In the rosetta-opf benchmarks, the data is read into dictionaries and those dictionaries are used to construct the loss function. In the SciMLBenchmarks, those dictionaries are instead made into a type-inferred DataRepresentation
template so that it can be used in a way that is type-stable.
The key here is that this difference only changes the timing on a subset of the methods, and when it does, it changes it in a very nonlinear way. It does not change the timing of JuMP or direct ModelingToolkit because symbolic interfaces do not directly use the cost function that the user writes down: they are declarative and use that definition to get a mathematical description and build their own type-stable function which embeds the numerical data gathered through tracing as literal constants in the constructed function. Meanwhile, the numerical front-ends (Optimization.jl, NLPmodels.jl, Nonconvex.jl, Optim.jl) all directly use the function that is defined, which has global non-constant dictionaries that are {Any,Any}
, and thus any decent Julia programmer can look at that and see “wow, the performance is going to be pretty bad there!”.
What the SciMLBenchmarks versions thus did was take that representation to a type-inferred DataRepresentation
to put the numerical front-ends on a more equal footing. But, it also ran all of the benchmarks in a single session. This means that the code for the DataRepresentation
, which is a large datatype with lots of type parameters to infer, is compiled once during the validation phase and then is its compilation is not seen in the benchmark phase.
Figuring out the Differences in the Details
The question then becomes, what are you trying to measure? While the Dict{Any,Any}
is going to pop out in any benchmark of Optimization.jl, it is a completely valid thing to want to benchmark the experience that such a user would get from a symbolic frontend vs a purely numerical frontend. It truly is the case that JuMP, ModelingToolkit, and all other symbolic interfaces are serving the user better in this kind of scenario because it’s ignoring the exact function the user provided and builds its own better function first, and that means that the user does not have to worry too much about the exact numerical performance details. This is a very valid concern, and this is what the rosetta-opf benchmark then chooses to measure: what if you have this user and they are trying to solve hard problems, what do they see on single runs?
However, that is similarly not as useful as a comparison between solvers if you are trying to match solver details. What I mean by that is, if what you’re trying to benchmark is how JuMP vs Optimization.jl are solving differently, then you want to make the cost and constraint function definitions as equal as possible to isolate any effect of the function building, and then run the same solver setups and see where you’re measuring differences. This is what the SciMLBenchmarks is thus geared towards by forcing the type stable representation so the two have effectively the same performance on the objective function, and any performance difference is thus attributed to differences in the solver setup (will elaborate on) since it’s using the same solver (IPOPT) in the two cases. This is thus helpful for the further development of the algorithms.
Now you might then think that you can take the SciMLBenchmarks version and stick it into rosetta-opf, but as you see in Optimization-SciMLBM
, it’s effectively the worst of both worlds. The reason is that if you are doing individual runs, the reading into DataRepresentation
requires inference to work a lot harder, and thus it takes about 3-5 seconds for the data reader to compile. SciMLBenchmarks again does not have this factor because all runs are done in the same session, so a specialized data reader is factored out. In theory this can be eliminated by changing PowerModels.jl to using a type template in its read, i.e. always generating the DataRepresentation
, and precompiling a lot of functions with respect to that. But in practice, this is not any better for JuMP users, or any symbolic front-end for that matter, and so many users won’t do this. In fact, it might be even strictly worse for JuMP users because it would spend more time compiling and inferring objects that it then ignores since the function used to construct the cost does not need to be fast given that it’s then transformed symbolically. So what you see in the Optimization-SciMLBM
column is it effectively spending most of its type compiling that DataRepresentation each time, which given the goals of the rosetta-opf is matching what the user would see.
What’s next for the Benchmarks?
So where do we go from here? This is where the discussion was to simply split the two along the lines of the two different goals trying to be achieved. The SciML crew are building out more to a pure ModelingToolkit symbolic frontend to JuMP, and we’ll be happy to see a benchmark really focused on symbolic frontends and so we’ll contribute that over to the rosetta-opf. It will be interesting to see at what point ModelingToolkit outperforms the naive Optimization.jl, and then the differences from JuMP. Meanwhile, we want to keep the SciMLBenchmarks really focused on the Optimization.jl vs JuMP numerical differences.
And what have we learned about the packages?
I think this is the most important piece. The current main contributors are very easy to identify:
- In the
Optimization-SciMLBM
it’s the compilation of the DataRepresentation. - In the Optimization.jl on its small side, it gets very close to JuMP in terms of performance but there seems to be a performance difference in the tape-compiled ReverseDiff.jl vs the specialized reverse-mode in JuMP. This difference I think gets alleviated with some Enzyme improvements, and so hopefully in the near future on the SciMLBenchmarks we’ll see the Optimization.jl vs JuMP on the smaller problems be pretty close to 1-1 when that is cleared up.
- On the larger side, the scaling of Optimization.jl, NLPModels, and Nonconvex is all pretty much dominated by the size of sparsity detection. The detection of sparsity from Julia code all uses the Symbolics.jl version which has worse performance than we’d like right now, allocating a bit too much and becoming the dominant factor. Using sparsity is such a huge improvement in these benchmarks that turning it off definitely hurts the scaling, but at a certain point in these benchmarks that auto-sparsity itself is 99% of the time. @odow got us a good benchmark case which highlights the
hessian_sparsity
performance Add OptimizationFrameworks/clnlbeam.jmd by odow · Pull Request #836 · SciML/SciMLBenchmarks.jl · GitHub and we’ll dive into that. There’s probably also some differences in the ReverseDiff tape-compiled vs Enzyme here too, but that won’t matter until this is handled.
So sans compile time and type-stability of the function, the main issues for getting other frameworks up to JuMP really come down to the fact that JuMP has its own specialized AD and that does really well on these functions + sparsity, while the general AD functionality is catching up but needs a better sparsity detection algorithm in order to really get there. We’ll keep banging away at that, and that can be tracked in the SciMLBenchmarks, but it won’t help the rosetta-opf benchmarks all that much because other than making the scaling improve, you’ll still see a large difference due to the type-instability. So on rosetta-opf we won’t expect to see any improvements from the Optimization.jl improvements until that part has a ModelingToolkit.jl frontend to get the type-stable function builds like JuMP.
Again @ccoffrin @odow thanks for taking the time to deep dive this.
tl;dr the numbers are very different, but they mean something completely different as well and we’ll just make sure to adequately document what rosetta-opf and SciMLBenchmarks are doing differently.