Flux.train! hangs up after updating packages

Alex0 · June 29, 2021, 5:04pm

Hello people,
I am currently trying to train a neural ODE network using a chemistry network from Catalyst.jl.
My code was working fine until I updated my packages. Unfortunatley I just did Pkg.update() and didnt really pay attention to what was updated, so I cant really go back.

What is happening now, is that Flux.train! seems to get stuck right away (no error, it just keeps running forever). Before the update it took me 90sec for 100 training iterations (on a very weak machine).

I am using Atom/Juno to write my code and this is showing up:
inf
Maybe someone with Atom/Juno experience knows what this means in my case?

I am using Julia 1.61, DiffEqFlux 1.41, Flux 0.12.4, DiffEqSensivity 6.54.0 and DifferentialEquations 6.17.1.
Its hard to figure out what wrong without error messages … my code was working as it is before the updates.
Any ideas whats wrong/happening?

EDIT: maybe I should mention that I batched up my code from the documentation and this paper.

aramirezreyes · June 29, 2021, 5:15pm

Not sure this is the case, but did you restart Julia after updating the packages?

ToucheSir · June 29, 2021, 5:48pm

Try obtaining gradients on just one batch of data and not using train!. That should give you a fast way to troubleshoot any issues.

ChrisRackauckas · June 29, 2021, 6:25pm

Yes, show the gradient calculation example as an MWE.

Alex0 · June 30, 2021, 1:14pm

Okay so I updated my code like this to look at the gradients and using update!, but now the training somehow works fine. I am happy with that but this doesnt really allow me to understand why train! is not working for me. update! is also showing much better results than train! (when it was working).

I guess this must be because of my model because the example code from the documentation using train! works just fine. But its still strange that it happened after updating my packages.
So I dont know how to make an MWE out of my code/model or whether it makes sense.

ChrisRackauckas · June 30, 2021, 2:45pm

You’re restructuring outside of the function:

https://github.com/aspecht1/neuralODE/blob/main/test.jl#L34

ToucheSir · June 30, 2021, 4:11pm

Also, ForwardDiff.gradient is different from Zygote.gradient, which Flux uses under the hood. You may be able to repro the issue by using the latter instead.

Alex0 · July 2, 2021, 7:08pm

Okay, sorry for the late response. So I replaced grad = ForwardDiff.gradient() with @time grad = Zygote.gradient()[1] and got this output:

1.0966585823313262e7
177.862870 seconds (287.27 M allocations: 20.921 GiB, 5.49% gc time, 32.76% compilation time)
┌ Warning: First function call produced NaNs. Exiting.
└ @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/2AoGt/src/initdt.jl:81
┌ Warning: Automatic dt set the starting dt as NaN, causing instability.
└ @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/2AoGt/src/solve.jl:510
┌ Warning: NaN dt detected. Likely a NaN value in the state, parameters, or derivative value caused this outcome.
└ @ SciMLBase ~/.julia/packages/SciMLBase/kCcpg/src/integrator_interface.jl:325
445024.28042782197
┌ Warning: First function call produced NaNs. Exiting.
└ @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/2AoGt/src/initdt.jl:81
┌ Warning: Automatic dt set the starting dt as NaN, causing instability.
└ @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/2AoGt/src/solve.jl:510
┌ Warning: NaN dt detected. Likely a NaN value in the state, parameters, or derivative value caused this outcome.
└ @ SciMLBase ~/.julia/packages/SciMLBase/kCcpg/src/integrator_interface.jl:325

Then the code hangs up again. So like you assumed the problem is calculating the gradient (with Zygote). Zygote.gradient() needs 177s to calculate the gradient, while ForwardDiff.gradient() only needs 0.02s.

Alex0 · July 2, 2021, 7:51pm

Is that a problem and which function are talking about? Im very new to this, so Im not quite sure what you are pointing to. But I am very happy to learn/get more insight how to do these things the best way.
So all the example codes I looked at are doing it in a similar way.
With p, re = Flux.destructure(NN) I obtain the initial network parameters and the re function to reconstruct the network, that I name re_NN in the line your referencing.

I will look deeper into it.

ChrisRackauckas · July 2, 2021, 7:57pm

If you don’t restructure in the ODE, the neural network is not a function of the parameters p.

re_NN = re(p)

function dudt!(du, u, p, t) #scale the data
    du .= re(p)(u) .* yscale / t_end
end

Topic		Replies	Views
Julia stops during training Flux Machine Learning question , juliapro	2	695	July 14, 2020
Errors running the new julialang.org blog post New to Julia	5	388	November 7, 2021
Flux.train error: no method matching back!(::Float64) Machine Learning diffeq , flux	4	1645	October 27, 2021
Taking gradient to update a Flux.jl CNN New to Julia question	2	329	May 4, 2024
Taking gradients of large differential equations Modelling & Simulations	4	529	April 3, 2021

Flux.train! hangs up after updating packages

Related topics