The heuristic is that repurposing X for doing Y when people trained in X have been using Z for Y for decades is a bad idea, unless you have a good reason to believe the contrary (where here X is diff eqs, Y is optimization and Z is other optimization methods). Basically, solving the gradient flow is strictly harder than finding a local minimum, so there’s no reason to believe that solving the gradient flow will be more helpful than methods that target minimization directly. More concretely the gradient does not point very accurately towards the minimum; finding something that does point towards the minimum (for a quadratic function) points you towards the Newton flow, but at that point you want to take steps=1 so the ODE viewpoint is not very useful.
Regarding papers, I would be surprised if there are not a thousand papers describing this for machine learning, but the only concrete examples I know of are in computational quantum physics, where it’s known under the poetic name of “imaginary time”. As I said, it’s definitely a useful point of view for getting insight (an example is the quantum Monte Carlo method, where you reformulate a minimization problem as a PDE and then exploit the relationship with stochastic differential equations) but not for designing concrete methods for general-purpose minimization.