StackOverflowError when training a neural network

I am training a Neural network using the Flux library. The definition of the neural network is as follows:

NN = Flux.Chain(
        Flux.Dense(13, 32, sigmoid),
        Flux.Dense(32, 32, sigmoid),
        Flux.Dense(32, 5),
        y->abs.(y))

It used to be working previously, I was able to run the training for 2000 iterations. But suddenly, from today, I am getting a StackOverflowError. I did not change anything in the code. I have tried restarting the PC, and also tried the execution in another PC. Still the same error. The error does not even have a stack trace.

ERROR: StackOverflowError:

The package status is as follows:

Project.toml`
  [fbb218c0] BSON v0.3.5
  [4ec6fef6] Bezier v0.1.7
  [41bf760c] DiffEqSensitivity v6.79.0   
  [0c46a032] DifferentialEquations v7.1.0
  [6a86dc24] FiniteDiff v2.13.0
  [587475ba] Flux v0.13.3
  [f6369f11] ForwardDiff v0.10.30        
  [91a5bcdd] Plots v1.30.2
  [e88e6eb3] Zygote v0.6.40

Version info:

Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS =

As the error has no further description, it is difficult to identify which variable/operation is causing the StackOverflow. Any advice on possible causes / troubleshotting methods?

Can you post the stack trace from the error? And perhaps versions from using Pkg; Pkg.status() and from versioninfo()?

2 Likes

I have updated in the original post

Does the error stack trace tell you the line numbers or at least the function where this is failing? We would need to see the code that causes the error.

Here’s a reduced version of the code. X1 and X2 vectors are populated with values from the training data, with n_pts = 585 and n_cases = 1276.

X1 = zeros(13, n_pts * n_cases)
X2 = zeros(5, n_pts * n_cases)
i = 0
for ics in 1:n_cases, ipt in 1:n_pts
    i += 1
    X1[:, i] = vcat(xyz[ipt, ics, :], uparams[ics, :], wa[ics])
    X2[:, i] = pp[ipt, ics, :]
end

loss_new(X1, X2, NN) = sum(abs2, NN(X1) - X2)
function loss4()
    return loss_new(X1, X2, NN)
end

# Training
data = Iterators.repeated((), 1000)
opt = Flux.ADAM(0.01, (0.9, 0.99))
Flux.train!(loss4, Flux.params(NN), data, opt, cb=cb)
ERROR: StackOverflowError:

The stack trace does not even tell the line number where the error occurs. But, if I run the code line-by-line, the error occurs at the last line - during the Flux.train. No further details unfortunately.

1 Like

I think the issue is likely to be the loss function, try using one of the inbuilt flux loss functions:
Flux Losses

You can try and run your current loss normally in the REPL to see if it works as expected, but I imagine it should look like:

sum(abs2.(NN(X1) .- X2))

I would suggest changing the train line to something like:

data=Iterators.repeated((X1,X2), 1000)
loss(x, y)=Flux.Losses.mse(NN(x),y)
Flux.train!(loss, params(NN), data, opt, cb=cb)

I am not able to run the code now, so you may need to tweak the code above to get it to work, but see if that helps.