How to ignore minibatches with NaN gradients optimizing a hybrid LUX model using Optimization.jl

progtw1 · September 24, 2025, 9:15am

I try following the LUX tutorial on fitting an ML model using Optimization.jl for fitting a hybrid model, where the LUX model predicts some parameters of a process-based model, the LUX model application is just a call in a larger model, and the LUX parameters are only a subset of the overall parameter vector.

Initially, not close to the solution, there are NaNs in the gradient for some of the minibatches.
When coding the training loop for pure LUX model as e.g. in this tutorial myself, I can just skip the update for those minibatches where the gradient contains any NaN.

if any(isnan.(grads))
    println("Skipped NaN : Batch $i")
else
    Optimisers.update!(opt_st_new, ps, grads)
end

How do I tell Optimization.jl to skip parameter updates during these minibatches, when using the solve method instead of the training-loop?

The following MWE demonstrates the problem without using any LUX model. After encountering the minibatch with NaNs, all the subsequent updates result in a loss value of NaN and there is no convergence to the optimum.

using Optimization
using OptimizationOptimisers
using MLUtils
import Zygote

d = fill(1.0, 100)
d[42:43] .= NaN
dl = DataLoader(d, batchsize=10)

callback_loss = (moditer) -> let iter = 1, moditer = moditer
    function (state, l)
        if iter % moditer == 1
            println("$iter, $l")
        end
        iter = iter + 1
        return false
    end
end

optf = Optimization.OptimizationFunction((x,d) -> sum(d .* abs2.(x)),
        Optimization.AutoZygote())
optprob = OptimizationProblem(optf, [2.0], dl)
alg = AdaMax(0.9)
#alg = Adam(0.9)
res = Optimization.solve(optprob, alg, epochs=6, 
     callback = callback_loss(2), 
)

ChrisRackauckas · October 24, 2025, 1:17pm

Yeah it was just a simple change that was needed:

github.com/SciML/Optimization.jl

Skip gradient updates when gradients contain NaN or Inf

master ← ChrisRackauckas-Claude:fix-nan-gradient-handling

opened 01:06PM - 24 Oct 25 UTC

ChrisRackauckas-Claude

+72 -21

## Summary This PR addresses the issue reported in the Julia Discourse thread a…bout handling NaN gradients during optimization with minibatches. When training neural networks with minibatches, some batches may produce NaN or Inf gradients due to numerical instability. Previously, applying these invalid gradients would corrupt all subsequent parameter updates, preventing the optimizer from converging. ## Changes ### Implementation - Added `has_nan_or_inf()` helper function that uses `Functors.fmap` to recursively check all gradient elements - Modified the optimization loop in `OptimizationOptimisers.jl` to skip parameter updates when NaN/Inf is detected - Iteration counter still increments even when updates are skipped (as per requirement) - Added warning message (with maxlog=10) to inform users when gradients are skipped ### Dependencies - Added `Functors` as a dependency for robust NaN/Inf checking across arbitrary nested structures ### Tests - Added comprehensive test suite that injects NaN and Inf values via callback - Verifies that optimization completes all iterations without crashes - Verifies that parameters remain finite even when encountering bad gradients - Tests both NaN and Inf cases separately ## Fixes Addresses: https://discourse.julialang.org/t/how-to-ignore-minibatches-with-nan-gradients-optimizing-a-hybrid-lux-model-using-optimization-jl/132615 ## Testing All existing tests pass. New tests verify: - Optimizer completes all iterations when encountering NaN gradients - Optimizer completes all iterations when encountering Inf gradients - Parameters remain finite (not NaN/Inf) after optimization - Iteration counter increments correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

will be in the next OptimizationOptimisers version.

Topic		Replies	Views
L-BFGS returns a NaN gradient when providing analytical gradient. BFGS shows no issues Optimization (Mathematical)	1	1545	January 31, 2021
Flux.jl vanilla ANN loss goes to NaN with mini batch Machine Learning question	0	1556	June 21, 2019
Using gradient function with Optim New to Julia question	4	1740	September 18, 2017
Getting NaNs in the hello world example of Flux Machine Learning question	2	763	October 28, 2021
Debugging Flux NaN problem New to Julia flux	0	413	June 17, 2020

How to ignore minibatches with NaN gradients optimizing a hybrid LUX model using Optimization.jl

Related topics