I wrote my own LBFGS code for a neural network with complex values. It worked until the step size became to small to change the network because of floating point position. Has anyone had similar problems and know how to solve it? I do have an Adam optimizer that is perfectly happy to optimize to half of the loss of the LBFGS code, but it is kinda slow, and I was hoping faster convergence.
Have you tried using an existing L-BFGS implementation? To apply an L-BFGS algorithm designed for real values to complex parameters, just map the real and imaginary parts of the parameters (and their gradients… presumably CR-calculus gradients since loss functions are non-holomorphic) to/from the complex values, i.e mapping \mathbb{C}^n to/from \mathbb{R}^{2n}.
L-BFGS is tricky to implement properly IIRC; there are a lot of corner cases to get right. There should be no need to write a custom “complex L-BFGS” just to work with complex parameters.
The step sizes getting that small is often a symptom of inaccurate gradients causing the algorithm to backtrack to arbitrarily small values (because it can’t satisfy the Wolfe conditions).
L-BFGS is much more sensitive to accurate gradients than Adam. Have you checked your gradients? (And Adam will often happily ignore occasional discontinuities in your function or its derivative, e.g. ReLU activation functions, whereas L-BFGS really wants your function to be twice-differentiable.)
I’m also assuming that you are solving a deterministic optimization problem, not a stochastic one like what Adam is designed for (and L-BFGS is not).