I still don’t see how they clash. I should clarify more I think. I am not saying that we should follow plain gradient descent. What we can do here is gradient flow follows Newton flow as close as possible or exactly. Then, step=1 is not important since ODE solvers can take large steps in linear problems. However, there is also nonlinear problems that you traditionally follow Newton flow locally but we usually do not see further and do not take large steps. It may help these cases. When I made up the example in the OP, I am not really saying du
has to be the gradient descent or the problem has to be linear or quadratic whatever. It is just shown as an example. We still can incorporate ideas from the sate-of-the-art existing optimizers here for what the du
can be.