How do I debug this in diffeqflux?

Allan_Baker · November 11, 2020, 3:21pm

Not exactly my latest code, but the interesting part was trying to get the last time of the solution.
When I used it in the loss function, it changed how the solution was reported, because the “Array” command failed.

function loss_adjoint(θ)
    s=predict_adjoint(θ)
    #println(typeof(s))
    #if isa(s,RecursiveArrayTools.DiffEqArray) # Bug fix
        # println(typeof(s))
        # @show s
        temp = Array(s)
        x=temp[:,end]
        # x=s.u[end]
        #@show x
        t=size(temp,2)  #Can't seem to get t out
        #println("tlen = $t")
        # println(typeof(t))
    # else
    #     x = s[:,end]
    #     #@show x
    #     t=s.t[end]
    #     #@show t
    # end

    miss = tgt_miss_distance(x)
    
    miss = miss < maxMiss ? miss : maxMiss
    #println("miss = $miss")
    loss = miss  + t*2
    return loss
end

The solve command is:

res = DiffEqFlux.sciml_train(x->loss_adjoint(x,10), θ, ADAM(0.001), cb = cb_plot, maxiters = 50)

Also, are differential equations with neural networks not able to train on GPUs? When I tried to add the |>gpu to the chain and then to the u0 it failed with a lot of red. I can probably send you the code to you if you can’t see the problem from that snippit.

Allan_Baker · November 12, 2020, 12:43am

Can @enum’s be passed as parameters into the train functions? I switched to the TrackerAdjoint() to hopefully be easier to debug and perhaps give me GPU capabilities, but I also tried to include an @enum as a parameter that gets switched by callbacks. But the TrackerAdjoint had problems converting to Float. I overloaded Abstract float but now I get a stackoverflow.

Allan_Baker · November 12, 2020, 12:44am

It must be something else. I changed the enum to just integers and it still overflowed.

ChrisRackauckas · November 12, 2020, 10:18am

The issue isn’t DiffEq (an example of this is at https://diffeqflux.sciml.ai/dev/examples/mnist_neural_ode/), the issue is that ReverseDiff isn’t GPU-compatible. Your function doesn’t look very GPU-parallelizable though, in the sense that it doesn’t expose enough parallelism for GPUs to actually accelerate it.

I don’t think so? I don’t think that could be differentiable.

Tracker builds big call stacks and can hit Julia’s stackoverflow even when it’s working as intended. It’s an issue and one of the main reasons we don’t use Tracker much anymore.

Allan_Baker · November 13, 2020, 5:27pm

The biggest problem I have is very cryptic error messages and how to track down what really is wrong. I try to run in the debugger in VSCode with @enter and the debugger crashes and I can’t see the output because it terminates Julia before I have a chance to read it. I figure looking at it in the debugger would help me see which of the lines of code in the loss function is blowing chunks and walk through the stack instead of the internals of the sensitivity functions. Not sure how to correlate cryptic messages back to lines of code.

Allan_Baker · November 13, 2020, 6:36pm

This I think is causing the stack overflow… I don’t understand.

make_p(nnetIn,adjustable_param) = [flying, adjustable_param, nnetIn...]

flying in this case is = 1.0 and represents a state that may change during the diffeq solve due to callbacks.

It is called like this:

prob = ODEProblem(simpleFly!,u0,tspan, make_p(θ,30.0), callback=cb_easy, saveat=saveDataAt)

or:

function predict_adjoint(θ, adjustable)
   s=solve(prob,Tsit5(),p=make_p(θ,adjustable),sensealg=sensitivity,abstol=accuracy,reltol=accuracy,saveat=saveDataAt)
end

theta is created like this.

ann_chain = Chain(Dense(annInputLen,64,tanh), 
                    Dense(64, 20, tanh),
                    Dense(20, 40, tanh),
                    Dense(40, 1, tanh)) 

θ, ann = Flux.destructure(ann_chain)

Not sure what is happening other than the theta may be a Float32 and the other values may be Float64s since I don’t specify. It may be getting into some kind of promote war???

ChrisRackauckas · November 13, 2020, 10:23pm

Splatting can build huge expressions. You should avoid splatting. Here you probably just want to vcat(flying,adjustable_param,nnetIn)

Allan_Baker · November 13, 2020, 11:38pm

Thank you so much for the assistance! That unstuck it. I had a 20,000 line history buffer on my julia terminal and the stack overflow warning would consume it, I couldn’t figure out how to see where the problem might be since the debug editor would stack overflow as well.

This is very good to know about splating. That should be a warning when building nnet inputs and other such things somewhere in the documentation. I just thought it was something else I was doing wrong. I think I had it at one point as a vcat, but thought splatting was better for some noob reason.

ChrisRackauckas · November 14, 2020, 12:19am

I was going to say you should look at the Julia performance tips page because it mentions that there are always performance issues with splatting big arrays.

https://docs.julialang.org/en/v1/manual/performance-tips/

And then… I realized this isn’t mentioned on that page, so we should make sure to add it

Allan_Baker · November 14, 2020, 12:26am

Also, the tracked_arrays have difficulty when I have types that I use to hold a parameter adjusting the derivative calculation that can be different. Now I just need to find out where my NaN crept in while I was doing all my what-if-this-is-it changes. For posterity, I went with this for the callback. It might have been overkill.

function ground_affect!(integrator)
    #println("Ground_affect")
    p = vcat(convert(typeof(integrator.p[1]),ground),integrator.p[2:end])
    integrator.p = p
end

where ground holds a number describing the integration state.

ChrisRackauckas · November 14, 2020, 4:59pm

Yeah, I’m hoping we can completely eliminate them in the future.

Allan_Baker · November 14, 2020, 6:39pm

I’m switching between the TrackerAdjoint and the ReverseDiffAdjoint to see where it helps find the bug since the VScode debugger kills Julia and I lose the screen with the errors, so I’m trying to step through the debugger running the sciml_train, command and it gets to a point where it can’t process a LinearAlgebra normalize function from linear algebra once it gets converted into a tracked array. I switched out to custom ones, but now my custom normalize is failing. I think its probably how the divide mutates the vector. It doesn’t like it. Very confusing, but I think it works in non-debug mode. Seems like a bug.

ChrisRackauckas · November 14, 2020, 6:42pm

Note where the AD tools are going: DifferentialEquations - Derivatives in ODE function/ nesting AD - #2 by ChrisRackauckas . I think most of these issues should be handled by what we’re moving things towards.

Allan_Baker · November 14, 2020, 6:49pm

That sounds very cool. Exciting stuff. Right now, I’m trying to figure out how to do an !isfinite test on an array to find my Nans, but it doesn’t have a ReverseDiff.TrackedArray equivalent. Makes debugging where and when the nan is coming from difficult.

To fix my normalize problem I had to change from:

#normalize(v) = (mag=norm(v); mag>0.0 ? v/mag : v)
normalize(v) = (mag=norm(v); mag>0.0 ? [v[1]/mag, v[2]/mag, v[3]/mag] : v)

ChrisRackauckas · November 14, 2020, 7:16pm

I’ll call in @mohamed82008 for the ReverseDiff issue.

mohamed82008 · November 14, 2020, 7:24pm

You can define:

Base.isfinite(x::TrackedArray) = isfinite(value(x)) && isfinite(deriv(x))

in ReverseDiff and open a PR to ReverseDiff.jl

Allan_Baker · November 14, 2020, 7:49pm

Hmmm… My first PR.

I think I solved my problem with strategic println’s and a fast CTRL-C trigger since the sciml_train solver tells me when it first gets NaNs. It would be beyond great to figure out a way to have the debugger stop on NaN generation. Not sure how to do that, but in my case, I had trouble getting the debugger to properly run at all.

For my case, I failed to notice an unprotected potential divide by zero on a condition which I set to zero once that part of the differential equation solution is no longer needed. I didn’t notice it after implementing it since it took me so long to get the stack-overflow solved. I’m sure I can think of a better way to do this.

Thanks Chris for all of the help. I’m sure you are very busy.

Allan_Baker · November 15, 2020, 2:04am

One more question. I can’t seem to find or guess at the syntax hinted at by the documentation. It seems like I can give the differential equation a custom return code from a callback.

I don’t see an example anywhere on google. Am I reading the help correctly for terminate!

This was my latest try.

groundhit_condition= function(u,y,integrator)
    u[6]
end

ground_terminate!(integrator) = terminate!(integrator, retcode=:Ground)
cb_ground = ContinuousCallback(groundhit_condition,ground_terminate!)

ChrisRackauckas · November 15, 2020, 3:08am

You can. I don’t think we’ve used it anywhere… and I don’t know if it’ll stick around after that moves to enums.

Topic		Replies	Views
DiffEqFlux NaN Bug with Neural ODE: "BoundsError: attempt to access 1-element Vector{Float64} at index [2]" Machine Learning neural-network	3	513	October 30, 2021
BoundError in Flux.train! with DiffEqFlux General Usage question	2	426	May 13, 2020
Best Practices for Debugging DiffEqFlux Errors Modelling & Simulations diffeqflux	1	323	May 6, 2022
Help training a NN that's used as a parameter in an ODE Machine Learning ode , diffeqflux	1	78	May 22, 2025
DiffEqFlux and Lux/Flux General Usage	7	1145	January 27, 2023

How do I debug this in diffeqflux?

Related topics