Small step size in LBFGS for neural network with complex values

Knud_Sorensen · July 5, 2026, 8:28am

I wrote my own LBFGS code for a neural network with complex values. It worked until the step size became to small to change the network because of floating point position. Has anyone had similar problems and know how to solve it? I do have an Adam optimizer that is perfectly happy to optimize to half of the loss of the LBFGS code, but it is kinda slow, and I was hoping faster convergence.

stevengj · July 5, 2026, 11:12am

Have you tried using an existing L-BFGS implementation? To apply an L-BFGS algorithm designed for real values to complex parameters, just map the real and imaginary parts of the parameters (and their gradients… presumably CR-calculus gradients since loss functions are non-holomorphic) to/from the complex values, i.e mapping \mathbb{C}^n to/from \mathbb{R}^{2n}.

L-BFGS is tricky to implement properly IIRC; there are a lot of corner cases to get right. There should be no need to write a custom “complex L-BFGS” just to work with complex parameters.

The step sizes getting that small is often a symptom of inaccurate gradients causing the algorithm to backtrack to arbitrarily small values (because it can’t satisfy the Wolfe conditions).

L-BFGS is much more sensitive to accurate gradients than Adam. Have you checked your gradients? (And Adam will often happily ignore occasional discontinuities in your function or its derivative, e.g. ReLU activation functions, whereas L-BFGS really wants your function to be twice-differentiable.)

I’m also assuming that you are solving a deterministic optimization problem, not a stochastic one like what Adam is designed for (and L-BFGS is not).

Knud_Sorensen · July 5, 2026, 6:26pm

No, I haven’t tried existent implementations, I couldn’t really find good example covering neural networks with complex values. I found an optima example with a simple function with complex values but here I wondered how to expand it to a full neural net.
I found some flux examples with full neural networks but with no mentioning of using complex values.

Thanks, I think that points to the root of the problem as I use L1 regularization there will be some discontinuities.

I will look into extend my L-BFGS code to OWL-QN code.

I have tried to run the code on minibatches, but mostly I train on all the data.

Thanks, for your reply.

stevengj · July 5, 2026, 6:46pm

Yes, L1 won’t work with L-BFGS, probably. But there are are lots of more specialized methods to combine L1 penalties with higher-order optimization algorithms.

The point is that the optimization algorithm doesn’t need to care that your values are complex — just pass each complex number as the real and imaginary parts to/from the optimizers. There’s really no reason to develop a specialized optimization algorithm for complex parameters.

(In the complex-analysis sense, any nonconstant real-valued loss function is a non-holomorphic function of complex parameters, so it is only differentiable in the sense of CR calculus, which is effectively equivalent to differentiating with respect to the real and imaginary parts taken as separate real parameters. So there is no algebraic reason why an optimization algorithm should care about the complex-number structure, as far as I know.)

Topic		Replies	Views
Using LBFGS to train Flux models General Usage	10	976	February 10, 2024
(L-)BFGS for a user-defined function General Usage optimization	6	825	July 27, 2021
Parameters of the neural network not updating after training in a Neural ODE problem New to Julia sciml , reversediff , differentialequation	13	511	February 16, 2025
`Optimization.LBFGS` fails to converge while `Optim.NelderMead()` works General Usage question , optim , optimization	11	492	March 29, 2025
How to train dense nets several times faster than with Adam Machine Learning optimization , neural-network	4	413	May 8, 2026

Small step size in LBFGS for neural network with complex values

Related topics