Solver and AD advice for nonlinear LS problem

(This is a vague open-ended question for which I cannot provide an MWE.)

I have a nonlinear least squares problem

\min_x \| f(x) \|_2^2

where x \in \mathbb{R}^N, N \approx 20...40, and f(x) has about 100–200 elements. f should be treated as a black box by the solver. There are no constraints on x.

I have f implemented in Julia, but it still needs optimization, specifically preallocation of buffers (but it has multithreading, oh my aching head). Currently it takes about 40s on a desktop.

I am using an implementation of a trust region method I wrote for a smaller problem, which calculates the Jacobian, using ForwardDiff.jl as a backend. I wonder an algorithm that calculates the Jacobian-vector product would make sense, but I am not sure which packages have a robust implementation (I could write one given a reference).

f allocates like crazy. I could pre-allocate the buffers but, not knowing the types ForwardDiff calls it with, I find this challenging. What are the best practices for preallocation combined with ForwardDiff?

Any other AD backend I should consider?

OhMyThreads.jl has nice functionality and documentation for this, if you want to give it a try.

Is there a reason not to use existing packages like NonlinearSolve.jl?

Also, is the Jacobian sparse? If so, you may achieve significant savings by exploiting it.

Do you mean a robust implementation of the JVP? You can always try them all with DifferentiationInterface.jl, the relevant operator is pushforward.

Probably PreallocationTools.jl?

Have you tried Enzyme.jl?

3 Likes

Yup all of this. NonlinearSolve’s methods are what would be most robust and efficient here, and it ties into DifferentiationInterface.jl so you should just be able to define the ADType and let it fly. The example recreating some of the PETSc examples should be a good starting point:

For nonlinear least squares, you simplify need to tell it what the output sizing is. A tutorial on that is found here:

You can make your system completely non-allocating, see the full discussion in:

That doesn’t discuss PreallocationTools.jl, but the docs there should be relatively straightforward.

If you really need performance, you can mix in some of the ModelingToolkit structural simplification tricks, see:

Whether that is applicable depends on the type of equations you have, it will increase compliation but in most cases should give a straight multiple order of magnitude performance gain if you can split to SCCs and all of the fancy stuff.

Finally the list of solvers is in:

LevenbergMarquardt() is not done by the book and there’s some changes that make it more stable than the standard form for ill-conditioned cases, I’d probably recommend that. But of course there is a TrustRegion method there.

These schemes are benchmarked in quite a lot of detail to make sure we have full performance and robustness, for the proof of that see:

3 Likes

Probably not for this size. Making sure you fill the matrix with sparse AD is probably a big deal, but using a sparse matrix in the LU or an iterative solver is likely not a good idea at this size.

1 Like

I am already using that package for the threaded computation, but I have not implemented the preallocation part because I am still not sure how to do it.

Conceptually, this is how it works: the computation is about 100–150 “recipes” (recipes are always the same), each needs a number of temporary matrices (varies for each recipe, but constant given the recipe) for a two-pass algorithm.

Probably the best option for me would be Channel, but I do not know in advance how many matrices I will need in advance.

To clarify with an MWE, consider the following:

using DifferentiationInterface, ForwardDiff

struct MyComputation
    N::Int
    M::Int
    recipes::Vector{Int}
end

function compute_recipe(α::T, N, M, recipe) where T
    matrices = Vector{Matrix{T}}(undef, recipe)
    # “backward pass” (for MWE)
    for i in recipe:(-1):1
        if i == recipe
            # this line should grab an empty matrix from a Channel
            matrices[i] = fill(exp(α), N, M)
        else
            # ditto
            matrices[i] = matrices[i + 1] .+ α
        end
    end
    # “forward pass” (for MWE)
    mapfoldl(sum, +, matrices)
end

function (mc::MyComputation)(α)
    (; N, M, recipes) = mc
    mapreduce(r -> compute_recipe(α, N, M, r), +, recipes)
end

# runtime code

mc = MyComputation(4, 5, sample(1:9, 20, replace = true))
@code_warntype mc(0.5)

DifferentiationInterface.value_and_derivative(mc, AutoForwardDiff(; chunksize = 1), 0.5)

Now recipes could come in any order depending on the scheduler, and I may need a variable number depending on the threads. Or can I add to the Channel “on demand”?

If it is possible to get a trial license, Artelys KNITRO has a least squares mode that uses Gauss-Newton and has always worked very well for my problems (solving the score equations of a GP parameter estimation problem, which is a very ugly but low-dimensional problem).

I don’t think the interface is exposed in JuMP, but it is implemented in KNITRO.jl. I have my own little wrapper demonstrating the boilerplate here.

Maybe worth a shot!

Quick edit: Maybe Bumper.jl could be useful for reducing allocations without too much pre-allocation suffering?