Bug in NLSolve?

I had a very ugly bug for more than a year. Now, thanks to AI, I found the reason and could fix it.

This bug was very ugly at first, because it does not always appear, only for certain numerical values. And secondly, not on every hardware, because subnormal values are handled differently, depending on the CPU.

Root cause: NLSolversBase 7.10.0 switched from direct FiniteDiff to DifferentiationInterface for Jacobian computation. The DI backend produces subnormal values (~5.6e-305) in near-zero Jacobian columns instead of exact zeros. NLsolve’s autoscale computes column norms d[j], and when d[j] is subnormal (not zero), its guard if d[j] == 0; d[j] = 1; end doesn’t trigger. The subsequent g ./ d.^2 underflows d[j]^2 to 0.0, producing NaN that propagates through the solver.

Fix: Both KPS4.jl and KPS3.jl now provide an explicit Jacobian function (jac!) to nlsolve using manual central finite differences (matching FiniteDiff’s defaults: step = cbrt(eps)). After computing the Jacobian, entries below sqrt(floatmin) (~1.5e-154) are flushed to zero—this prevents any value whose square would underflow from reaching NLsolve’s autoscale.

To reproduce:

git clone https://github.com/OpenSourceAWE/KiteModels.jl.git
cd KiteModels.jl
git checkout bug
julia --project -e 'using Pkg; Pkg.instantiate(); Pkg.test()'

This PR demonstrates the bug: Branch to demonstrate a bug in NLSolve by ufechner7 · Pull Request #258 · OpenSourceAWE/KiteModels.jl · GitHub
And this PR provides the fix: Update NLSolve, fix NLsolve issue by ufechner7 · Pull Request #259 · OpenSourceAWE/KiteModels.jl · GitHub

The fix:

"""
    make_jac(f!, n_vars)

Create an explicit Jacobian function using central finite differences.

Works around NLSolversBase >= 7.10 producing subnormal Jacobian entries via
DifferentiationInterface, which cause NaN in NLsolve autoscale (subnormal
column norms squared underflow to zero).
"""
function make_jac(f!, n_vars)
    _F1 = zeros(SimFloat, n_vars)
    _F2 = zeros(SimFloat, n_vars)
    _xp = zeros(SimFloat, n_vars)
    _xm = zeros(SimFloat, n_vars)
    threshold = sqrt(floatmin(SimFloat))
    function jac!(J, x)
        h_factor = cbrt(eps(SimFloat))
        for j in 1:n_vars
            copyto!(_xp, x)
            copyto!(_xm, x)
            h = max(abs(x[j]), one(SimFloat)) * h_factor
            _xp[j] += h
            _xm[j] -= h
            f!(_F1, _xp)
            f!(_F2, _xm)
            @views J[:, j] .= (_F1 .- _F2) ./ (2h)
        end
        # Flush near-zero entries whose squares would underflow, preventing NaN in autoscale
        @. J = ifelse(abs(J) < threshold, zero(SimFloat), J)
    end
    return jac!
end

Where and how should I report this bug?

1 Like

This seems like a bug in NLsolve — such a test is not very robust. At the very least they could change the test to d[j]^2 == 0 to check for underflow: handle underflow of norm(J[:,i])^2 by stevengj · Pull Request #296 · JuliaNLSolvers/NLsolve.jl · GitHub

1 Like

Can you provide an MWE for this behavior? I might have missed a setting in the FiniteDiff bindings for DI if they don’t behave identically

That is difficult. Most of the time, it doesn’t happen.