How do I debug this in diffeqflux?

Not exactly my latest code, but the interesting part was trying to get the last time of the solution.
When I used it in the loss function, it changed how the solution was reported, because the “Array” command failed.

function loss_adjoint(θ)
    s=predict_adjoint(θ)
    #println(typeof(s))
    #if isa(s,RecursiveArrayTools.DiffEqArray) # Bug fix
        # println(typeof(s))
        # @show s
        temp = Array(s)
        x=temp[:,end]
        # x=s.u[end]
        #@show x
        t=size(temp,2)  #Can't seem to get t out
        #println("tlen = $t")
        # println(typeof(t))
    # else
    #     x = s[:,end]
    #     #@show x
    #     t=s.t[end]
    #     #@show t
    # end

    miss = tgt_miss_distance(x)
    
    miss = miss < maxMiss ? miss : maxMiss
    #println("miss = $miss")
    loss = miss  + t*2
    return loss
end

The solve command is:

res = DiffEqFlux.sciml_train(x->loss_adjoint(x,10), θ, ADAM(0.001), cb = cb_plot, maxiters = 50)

Also, are differential equations with neural networks not able to train on GPUs? When I tried to add the |>gpu to the chain and then to the u0 it failed with a lot of red. I can probably send you the code to you if you can’t see the problem from that snippit.

Can @enum’s be passed as parameters into the train functions? I switched to the TrackerAdjoint() to hopefully be easier to debug and perhaps give me GPU capabilities, but I also tried to include an @enum as a parameter that gets switched by callbacks. But the TrackerAdjoint had problems converting to Float. I overloaded Abstract float but now I get a stackoverflow.

It must be something else. I changed the enum to just integers and it still overflowed.

The issue isn’t DiffEq (an example of this is at https://diffeqflux.sciml.ai/dev/examples/mnist_neural_ode/), the issue is that ReverseDiff isn’t GPU-compatible. Your function doesn’t look very GPU-parallelizable though, in the sense that it doesn’t expose enough parallelism for GPUs to actually accelerate it.

I don’t think so? I don’t think that could be differentiable.

Tracker builds big call stacks and can hit Julia’s stackoverflow even when it’s working as intended. It’s an issue and one of the main reasons we don’t use Tracker much anymore.

The biggest problem I have is very cryptic error messages and how to track down what really is wrong. I try to run in the debugger in VSCode with @enter and the debugger crashes and I can’t see the output because it terminates Julia before I have a chance to read it. I figure looking at it in the debugger would help me see which of the lines of code in the loss function is blowing chunks and walk through the stack instead of the internals of the sensitivity functions. Not sure how to correlate cryptic messages back to lines of code.

This I think is causing the stack overflow… I don’t understand.

make_p(nnetIn,adjustable_param) = [flying, adjustable_param, nnetIn...]

flying in this case is = 1.0 and represents a state that may change during the diffeq solve due to callbacks.

It is called like this:

prob = ODEProblem(simpleFly!,u0,tspan, make_p(θ,30.0), callback=cb_easy, saveat=saveDataAt)

or:

function predict_adjoint(θ, adjustable)
   s=solve(prob,Tsit5(),p=make_p(θ,adjustable),sensealg=sensitivity,abstol=accuracy,reltol=accuracy,saveat=saveDataAt)
end

theta is created like this.

ann_chain = Chain(Dense(annInputLen,64,tanh), 
                    Dense(64, 20, tanh),
                    Dense(20, 40, tanh),
                    Dense(40, 1, tanh)) 

θ, ann = Flux.destructure(ann_chain)      

Not sure what is happening other than the theta may be a Float32 and the other values may be Float64s since I don’t specify. It may be getting into some kind of promote war???

Splatting can build huge expressions. You should avoid splatting. Here you probably just want to vcat(flying,adjustable_param,nnetIn)

Thank you so much for the assistance! That unstuck it. I had a 20,000 line history buffer on my julia terminal and the stack overflow warning would consume it, I couldn’t figure out how to see where the problem might be since the debug editor would stack overflow as well.

This is very good to know about splating. That should be a warning when building nnet inputs and other such things somewhere in the documentation. I just thought it was something else I was doing wrong. I think I had it at one point as a vcat, but thought splatting was better for some noob reason.

I was going to say you should look at the Julia performance tips page because it mentions that there are always performance issues with splatting big arrays.

https://docs.julialang.org/en/v1/manual/performance-tips/

And then… I realized this isn’t mentioned on that page, so we should make sure to add it :wink:

Also, the tracked_arrays have difficulty when I have types that I use to hold a parameter adjusting the derivative calculation that can be different. Now I just need to find out where my NaN crept in while I was doing all my what-if-this-is-it changes. For posterity, I went with this for the callback. It might have been overkill.

function ground_affect!(integrator)
    #println("Ground_affect")
    p = vcat(convert(typeof(integrator.p[1]),ground),integrator.p[2:end])
    integrator.p = p
end

where ground holds a number describing the integration state.

Yeah, I’m hoping we can completely eliminate them in the future.

I’m switching between the TrackerAdjoint and the ReverseDiffAdjoint to see where it helps find the bug since the VScode debugger kills Julia and I lose the screen with the errors, so I’m trying to step through the debugger running the sciml_train, command and it gets to a point where it can’t process a LinearAlgebra normalize function from linear algebra once it gets converted into a tracked array. I switched out to custom ones, but now my custom normalize is failing. I think its probably how the divide mutates the vector. It doesn’t like it. Very confusing, but I think it works in non-debug mode. Seems like a bug.

Note where the AD tools are going: DifferentialEquations - Derivatives in ODE function/ nesting AD - #2 by ChrisRackauckas . I think most of these issues should be handled by what we’re moving things towards.

That sounds very cool. Exciting stuff. Right now, I’m trying to figure out how to do an !isfinite test on an array to find my Nans, but it doesn’t have a ReverseDiff.TrackedArray equivalent. Makes debugging where and when the nan is coming from difficult.

To fix my normalize problem I had to change from:

#normalize(v) = (mag=norm(v); mag>0.0 ? v/mag : v)
normalize(v) = (mag=norm(v); mag>0.0 ? [v[1]/mag, v[2]/mag, v[3]/mag] : v)

I’ll call in @mohamed82008 for the ReverseDiff issue.

You can define:

Base.isfinite(x::TrackedArray) = isfinite(value(x)) && isfinite(deriv(x))

in ReverseDiff and open a PR to ReverseDiff.jl :wink:

Hmmm… My first PR.

I think I solved my problem with strategic println’s and a fast CTRL-C trigger since the sciml_train solver tells me when it first gets NaNs. It would be beyond great to figure out a way to have the debugger stop on NaN generation. Not sure how to do that, but in my case, I had trouble getting the debugger to properly run at all.

For my case, I failed to notice an unprotected potential divide by zero on a condition which I set to zero once that part of the differential equation solution is no longer needed. I didn’t notice it after implementing it since it took me so long to get the stack-overflow solved. I’m sure I can think of a better way to do this.

Thanks Chris for all of the help. I’m sure you are very busy.

One more question. I can’t seem to find or guess at the syntax hinted at by the documentation. It seems like I can give the differential equation a custom return code from a callback.

I don’t see an example anywhere on google. Am I reading the help correctly for terminate!

This was my latest try.

groundhit_condition= function(u,y,integrator)
    u[6]
end

ground_terminate!(integrator) = terminate!(integrator, retcode=:Ground)
cb_ground = ContinuousCallback(groundhit_condition,ground_terminate!)

You can. I don’t think we’ve used it anywhere… and I don’t know if it’ll stick around after that moves to enums.