(This is a vague open-ended question for which I cannot provide an MWE.)
I have a nonlinear least squares problem
where x \in \mathbb{R}^N, N \approx 20...40, and f(x) has about 100–200 elements. f should be treated as a black box by the solver. There are no constraints on x.
I have f implemented in Julia, but it still needs optimization, specifically preallocation of buffers (but it has multithreading, oh my aching head). Currently it takes about 40s on a desktop.
I am using an implementation of a trust region method I wrote for a smaller problem, which calculates the Jacobian, using ForwardDiff.jl as a backend. I wonder an algorithm that calculates the Jacobian-vector product would make sense, but I am not sure which packages have a robust implementation (I could write one given a reference).
f allocates like crazy. I could pre-allocate the buffers but, not knowing the types ForwardDiff
calls it with, I find this challenging. What are the best practices for preallocation combined with ForwardDiff?
Any other AD backend I should consider?