InfiniteOpt.jl might be a convenient choice for you. I used it for a parameter estimation problem of about the size you are mentioning. Case study 3 (section 5.3) in this paper describes the problem and the implementation.
Learn more about InfiniteOpt.jl here. If you need any help implementing your model, feel free to reach out on the InfiniteOpt.jl discussions forum.
@mzhenirovskyy describes a “10–20” dimensional optimization problem, which is not only finite but relatively small. Why would you frame this as an infinite-dimensional optimization problem for InfiniteOpt?
The first thing I would do is to try to calculate sensitivities (derivatives) of your loss function with respect to your 10–20 parameters, by solving the adjoint DAE. DifferentialEquations.jl has tools to help with this. Once you have that, then there are a huge variety of optimization algorithms you can use to minimize your loss function, but generically I would try something like L-BFGS first (assuming you have a smooth function of the parameters and simple bound/box constraints). There are a variety of implementations of this available in Julia. If you have a ton of data that you are trying to fit to, so that you are only sampling it in batches, then you might need a stochastic optimization algorithm like Adam.
10–20 parameters is small enough that you could even use derivative-free optimization if your objective evaluation is quick enough (you probably lose a factor of 10–100 in speed compared to a gradient-based optimization method).
If you want to scale well with the number of parameters, then you need to compute gradients (sensitivities) by an adjoint method (called “backpropagation” by the neural-network folks). That way, you can (locally) optimize over essentially any number of parameters, and the cost per optimization step is essentially that of solving your forward problem (your DAE) twice. It’s not unusual to do (local) optimization over millions of parameters in this way.
If you want to scale well with your DAE system size, that’s usually about exploiting nice properties of your system, e.g. sparsity of some sort. If you can make your “forward” solve fast, then the sensitivity analysis and the optimization will be fast as well. But others here have more expertise in fast DAE solvers than me.
(Neither of these things is specific to Julia, of course.)
It uses the adjoints that @stevengj describes by default. Exquisite amounts of detail can be found in this page, though for most folks it shouldn’t be too necessary.
If you want the details on how to scale to very large systems, this tutorial covers how to choose solvers and preconditioners for handling large PDE solves:
The end of that tutorial shows how to solve a 2000 ODE system in 87 ms.
DiffEqParamEstim is just a simplified parameter estimation system. DiffEqFlux uses GalacticOptim internally, and soon it will be more explicitly. The core is really just that differentiating DifferentialEquations.jl uses adjoints, and you can use that with any optimization package.