Bug in NLSolve

I had a very ugly bug for more than a year. Now, thanks to AI, I found the reason and could fix it.

This bug was very ugly at first, because it does not always appear, only for certain numerical values. And secondly, not on every hardware, because subnormal values are handled differently, depending on the CPU.

Root cause: NLSolversBase 7.10.0 switched from direct FiniteDiff to DifferentiationInterface for Jacobian computation. The DI backend produces subnormal values (~5.6e-305) in near-zero Jacobian columns instead of exact zeros. NLsolve’s autoscale computes column norms d[j], and when d[j] is subnormal (not zero), its guard if d[j] == 0; d[j] = 1; end doesn’t trigger. The subsequent g ./ d.^2 underflows d[j]^2 to 0.0, producing NaN that propagates through the solver.

Fix: Both KPS4.jl and KPS3.jl now provide an explicit Jacobian function (jac!) to nlsolve using manual central finite differences (matching FiniteDiff’s defaults: step = cbrt(eps)). After computing the Jacobian, entries below sqrt(floatmin) (~1.5e-154) are flushed to zero—this prevents any value whose square would underflow from reaching NLsolve’s autoscale.

To reproduce:

git clone https://github.com/OpenSourceAWE/KiteModels.jl.git
cd KiteModels.jl
git checkout bug
julia --project -e 'using Pkg; Pkg.instantiate(); Pkg.test()'

This PR demonstrates the bug: Branch to demonstrate a bug in NLSolve by ufechner7 · Pull Request #258 · OpenSourceAWE/KiteModels.jl · GitHub
And this PR provides the fix: Update NLSolve, fix NLsolve issue by ufechner7 · Pull Request #259 · OpenSourceAWE/KiteModels.jl · GitHub

The fix:

"""
    make_jac(f!, n_vars)

Create an explicit Jacobian function using central finite differences.

Works around NLSolversBase >= 7.10 producing subnormal Jacobian entries via
DifferentiationInterface, which cause NaN in NLsolve autoscale (subnormal
column norms squared underflow to zero).
"""
function make_jac(f!, n_vars)
    _F1 = zeros(SimFloat, n_vars)
    _F2 = zeros(SimFloat, n_vars)
    _xp = zeros(SimFloat, n_vars)
    _xm = zeros(SimFloat, n_vars)
    threshold = sqrt(floatmin(SimFloat))
    function jac!(J, x)
        h_factor = cbrt(eps(SimFloat))
        for j in 1:n_vars
            copyto!(_xp, x)
            copyto!(_xm, x)
            h = max(abs(x[j]), one(SimFloat)) * h_factor
            _xp[j] += h
            _xm[j] -= h
            f!(_F1, _xp)
            f!(_F2, _xm)
            @views J[:, j] .= (_F1 .- _F2) ./ (2h)
        end
        # Flush near-zero entries whose squares would underflow, preventing NaN in autoscale
        @. J = ifelse(abs(J) < threshold, zero(SimFloat), J)
    end
    return jac!
end

Where and how should I report this bug?

1 Like

This seems like a bug in NLsolve — such a test is not very robust. At the very least they could change the test to d[j]^2 == 0 to check for underflow: handle underflow of norm(J[:,i])^2 by stevengj · Pull Request #296 · JuliaNLSolvers/NLsolve.jl · GitHub

2 Likes

Can you provide an MWE for this behavior? I might have missed a setting in the FiniteDiff bindings for DI if they don’t behave identically

That is difficult. Most of the time, it doesn’t happen.

I mean, the test case I shared is 100% reproducible. But the problem appears only after 40 iterations, and the model is complex, with ~ 60 states.

that looks like it might be reading uninitialized memory.

2 Likes

Perhaps this MWE will do? But it is AI generated:

# SPDX-FileCopyrightText: 2025 Uwe Fechner
# SPDX-License-Identifier: MIT

# MWE: NLsolve trust-region autoscale NaN bug
#
# Bug location:
#   NLsolve/src/solvers/trust_region.jl, initial autoscale block (~line 150):
#
#       for j = 1:nn
#           cache.d[j] = norm(view(jacobian(df), :, j))
#           if cache.d[j] == zero(cache.d[j])    # ← BUG: only catches exact 0.0
#               cache.d[j] = one(cache.d[j])
#           end
#       end
#
#   Later in the dogleg step (~line 78):
#
#       g .= g ./ d.^2                            # ← NaN when d[j]^2 underflows to 0.0
#
# Root cause:
#   When a Jacobian column has very tiny entries (e.g. ~1e-305), the column norm
#   d[j] is nonzero, so the guard `d[j] == 0` does NOT trigger. But d[j]^2
#   underflows to 0.0, and the subsequent division g ./ d.^2 produces NaN.
#
#   Any value below sqrt(floatmin(Float64)) ≈ 1.49e-154 will have its square
#   underflow to a subnormal or zero. The current guard only catches exact 0.0.
#
# Context:
#   NLSolversBase 7.10.0 switched from direct FiniteDiff to DifferentiationInterface.
#   For certain real-world systems the DI backend produces tiny near-zero values
#   in Jacobian columns that should be zero, instead of exact zeros. This triggers
#   the faulty guard above.
#
# Suggested fix:
#   Replace `d[j] == zero(d[j])` with a check that also catches values whose
#   square would underflow, e.g.:
#       if d[j] <= sqrt(floatmin(eltype(cache.d)))
#           cache.d[j] = one(cache.d[j])
#       end
#
# Tested with NLsolve 4.5.1, but the bug is in all versions with this autoscale guard.

using NLsolve, LinearAlgebra

# ----- Step 1: Show the arithmetic issue -----
println("=== Floating-point underflow in d.^2 ===")
println("floatmin(Float64) = ", floatmin(Float64), "  (smallest normal Float64)")
println("sqrt(floatmin)    = ", sqrt(floatmin(Float64)))
println()
for v in [1e-100, 1e-154, 1e-155, 1e-200, 1e-305]
    sq = v^2
    println("  d = $v  →  d^2 = $sq", sq == 0.0 ? "  ← UNDERFLOWS TO ZERO" : "")
end
println()

# ----- Step 2: Trigger NaN in NLsolve -----
# A simple 2-equation system. We provide an explicit Jacobian where column 2
# has tiny entries, simulating what DifferentiationInterface produces for
# near-zero partial derivatives in real applications.

function f!(F, x)
    F[1] = x[1] - 1.0
    F[2] = x[1]^2 - 4.0
    return F
end

function j!(J, x)
    J[1, 1] = 1.0
    J[2, 1] = 2.0 * x[1]
    # Column 2: tiny values that are NOT exactly zero.
    # These are above floatmin (so they are normal Float64), but their square
    # underflows to 0.0, causing division by zero in the dogleg step.
    J[1, 2] = 1e-200
    J[2, 2] = 1e-200
    return J
end

x0 = [0.5, 0.5]
println("=== NLsolve trust_region with autoscale=true ===")
println("Starting point: ", x0)

# Verify the Jacobian column norm issue
d2 = norm([1e-200, 1e-200])
println("Column 2 norm:    d = ", d2)
println("  d == 0.0:       ", d2 == 0.0, "  (guard does NOT trigger)")
println("  d^2:            ", d2^2, "  (underflows to zero!)")
println("  1.0 / d^2:      ", 1.0 / d2^2, "  → NaN propagation")
println()

result = nlsolve(f!, j!, x0, method=:trust_region, autoscale=true)
println("Converged: ", converged(result))
println("Solution:  ", result.zero)
println("NaN in solution: ", any(isnan, result.zero))
println()

# ----- Step 3: Show exact zeros work (guard triggers correctly) -----
println("=== Same system, but with EXACT zeros in Jacobian column 2: ===")
function j_zero!(J, x)
    J[1, 1] = 1.0;  J[2, 1] = 2.0 * x[1]
    J[1, 2] = 0.0;  J[2, 2] = 0.0          # exact zeros → d[j]==0 guard triggers
    return J
end
result2 = nlsolve(f!, j_zero!, x0, method=:trust_region, autoscale=true)
println("Converged: ", converged(result2))
println("Solution:  ", result2.zero)
println("NaN: ", any(isnan, result2.zero))
println()
println("The ONLY difference is 1e-200 vs 0.0 in Jacobian column 2.")
println("With 1e-200: d[j] ≠ 0 → guard skipped → d[j]^2 underflows → NaN")
println("With 0.0:    d[j] == 0 → guard triggers → d[j] = 1 → OK")

Did you run it to check whether it produces the expected behavior on your computer?

Yes, it produces the expected result:

julia> include("mwes/mwe_nlsolve_bug.jl")
=== Floating-point underflow in d.^2 ===
floatmin(Float64) = 2.2250738585072014e-308  (smallest normal Float64)
sqrt(floatmin)    = 1.4916681462400413e-154

  d = 1.0e-100  →  d^2 = 1.0e-200
  d = 1.0e-154  →  d^2 = 1.0e-308
  d = 1.0e-155  →  d^2 = 1.0e-310
  d = 1.0e-200  →  d^2 = 0.0  ← UNDERFLOWS TO ZERO
  d = 1.0e-305  →  d^2 = 0.0  ← UNDERFLOWS TO ZERO

=== NLsolve trust_region with autoscale=true ===
Starting point: [0.5, 0.5]
Column 2 norm:    d = 1.414213562373095e-200
  d == 0.0:       false  (guard does NOT trigger)
  d^2:            0.0  (underflows to zero!)
  1.0 / d^2:      Inf  → NaN propagation

Converged: false
Solution:  [NaN, NaN]
NaN in solution: true

=== Same system, but with EXACT zeros in Jacobian column 2: ===
Converged: false
Solution:  [1.938537191316301, 0.5]
NaN: false

The ONLY difference is 1e-200 vs 0.0 in Jacobian column 2.
With 1e-200: d[j] ≠ 0 → guard skipped → d[j]^2 underflows → NaN
With 0.0:    d[j] == 0 → guard triggers → d[j] = 1 → OK

Could someone approve the workflow run of the PR?