DifferentialEquations.jl: Exception thrown in callback of solve command that is handled by a try-catch block inside another solve command causes the outer solve command to fail

I observed this with the latest version of julia, Running on a Windows 11 PC.

tl dr: I have an exception being thrown in a callback for a solve command that’s inside another solve command. I have a try-catch block to handle the exception. However, the outer solve command fails due to the exception. This doesn’t happen when the exception is thrown outside of the callback. This happens regardless of whether I make a custom exception or cause a built-in exception.


Minimal Working Example:

module julia_try_catch_ode

export startup

this line is commented out in the MWE # struct SpatialError <: Exception end

function startup()
@eval using DifferentialEquations
dfdt_0 = [0.0]
tbounds = [0.0, 5.0]
dfdt_params =
dfdt_prob = invokelatest(ODEProblem,dfdt, dfdt_0, tbounds, dfdt_params)
dfdt_sol = invokelatest(solve,dfdt_prob)
end

function dfdt(df, f, p, t)
df = [-1.0]
internal_IC = [0.0]
internal_bounds = [0.0, 5.0]
internal_params =
test_callback = DiscreteCallback(example_callback, sqrt(-1)) # throw(SpatialError)
internal_prob = invokelatest(ODEProblem,dfdt_2, internal_IC, internal_bounds, internal_params)
try
doomed_solve = invokelatest(solve,internal_prob, callback = x_first_callback)
catch e
print(“doomed solver included. No custom exception. \n”)
print(t)
print(“\n”)
print(“e\n”)
print(e)
print(“\n”)
end
return df
end

function dfdt_2(df, f, p, t)
df = [-1.0]
return df
end

function example_callback(var, t, integrator)
return true
end

end


Why does it matter?

I am solving a system that varies in space and time. My solution has the following general structure:

Outer call to solve for change over time

Inner call to solve for spatial change that’s computed at every time step of the outer solver

The inner call has a callback that is meant to prevent a square root domain error under certain conditions. Particularly sometimes an error in the time-dependent solver causes the inner solver to enter a state where square roots of negative numbers will be requested. This indicates instability, and means the spatial operation needs to be terminated. If I don’t terminate it, an error will occur.

Unfortunately, the spatial function is extremely stiff, so radau() is the only method that can handle it. The terminate! callback is not implemented for radau(). Thus, as a workaround, I made a callback that throws a custom exception, and a try-catch block that handles this exception.

Unfortunately, this causes the outer ODE solver to fail.

As you can see from the MWE above, a custom exception isn’t needed to see this behavior. Nor is radau() required. Any exception in a callback, even if handled with try-catch, will cause the outer solver to fail.

I believe this is a bug, so I want to bring it to the developer’s attention. I would also appreciate any suggestions from the community for a workaround.

I don’t want to remove the callback and just make my try-catch exempt the square root domain error (this would “fix” the MWE) because then I wouldn’t see if there’s a domain error due to an unexpected bug. I know about the specific parameter values that cause this specific domain error, so I only want to stop the spatial integration and retry with a different set of parameters under that specific condition, not whenever a domain error occurs.

This workaround, suggested by bing copilot AI, solves the MWE. I’m not sure if it solves my real use case (waiting for the high performance computer to run) but I feel optimistic. Injecting a NaN into the integrator seems to correctly indicate that there is instability and cause the integration to terminate in the right way

I still think there’s a bug here related to exception handling in callbacks for nested solve functions for DifferentialEquation.jl

“Fixed” MWE using an injected NaN:

module julia_try_catch_ode

export startup

struct SpatialError <: Exception end

function startup()
@eval using DifferentialEquations
dfdt_0 = [0.0]
tbounds = [0.0, 5.0]
dfdt_params =
dfdt_prob = invokelatest(ODEProblem, dfdt, dfdt_0, tbounds, dfdt_params)
dfdt_sol = invokelatest(solve, dfdt_prob)
end

function dfdt(df, f, p, t)
df[1] = -1.0
internal_IC = [0.0]
internal_bounds = [0.0, 5.0]
internal_params =
test_callback = DiscreteCallback(example_callback, handle_callback)
internal_prob = invokelatest(ODEProblem, dfdt_2, internal_IC, internal_bounds, internal_params)

try
    doomed_solve = invokelatest(solve, internal_prob, callback = test_callback)
catch e
    println("doomed solver included")
    println("time: ", t)
    println("exception: ", e)
end

return df

end

function dfdt_2(df, f, p, t)
df[1] = -1.0
return df
end

function example_callback(var, t, integrator)
return true
end

function handle_callback(integrator)
try
throw(SpatialError(“Example spatial error”))
catch e
println("Handled exception in callback: ", e)
integrator.u[1] = NaN # Set a specific value or flag to handle the exception
end
return
end

end

Hi there :slight_smile:
Please post code enclosed in ``` to make it a lot more readable.

While I am not sure about your problem specifically, your MWE does have several issues I’d like to address:

Module loading

You should load modules at toplevel and not via @eval. Using @eval using ... basically makes performant code impossible and make everything much more inconvenient (e.g. you need to use invokelatest).

Undefined variables

Your MWE uses the variable x_first_callback but it is never defined. You probably meant test_callback. However…

Wrong construction of DiscreteCallback

you never arrive there because your code errors on constructing the DiscreteCallback. You wrote:

test_callback = DiscreteCallback(example_callback, sqrt(-1))

This fails immediately with DomainError because you compute sqrt(-1) before even constructing the callback! You probably forgot to make it an anonymous function like so:

test_callback = DiscreteCallback(example_callback, (integrator) -> sqrt(-1))

This perhaps also explains why the “outer” solver fails: You never catch your exception because it is thrown outside the try-catch that was intended to catch it.

Your improved MWE doesn’t have this problem because test_callback is constructed correctly.

It is unclear to me what the consequence of a failure of the inner solver for the outer solver should be so I can’t recommend a good way to solve your original problem. I suggest you open a new thread where you explain the structure of the problem you want to solve and then likely someone knowledgeable will pop in, explaining how to handle your case in the best way :slight_smile:

Here is a suggestion of how your MWE code could look like (where I also fixed some other issues your code had):

module julia_try_catch_ode

using DifferentialEquations

function run()
    dfdt_0 = [0.0]
    tbounds = (0.0, 5.0) # tuple is preferred
    dfdt_prob = ODEProblem(dfdt!, dfdt_0, tbounds)
    dfdt_sol = solve(dfdt_prob)
end

function dfdt!(df, f, p, t) # mutating function get a ! in Julia
    # df = [-1.0] # this doesn't overwrite df
    df[1] = -1.0
    internal_IC = [0.0]
    internal_bounds = (0.0, 5.0)
    test_callback = DiscreteCallback((_,_,_)->true, (_)->sqrt(-1))
    internal_prob = ODEProblem(dfdt_2, internal_IC, internal_bounds)
    try
        doomed_solve = solve(internal_prob, callback = test_callback)
    catch e
        # internal solve failed
        # what does that mean for the outer solve?
    end
    # return df # no need to return
end

function dfdt_2(df, f, p, t)
    # df = [-1.0] # this doesn't overwrite df
    df[1] = -1.0
end

run()
end
3 Likes

radau() update

First, an update for anyone who finds this thread because they are dealing with radau() specifically:

Injecting a NaN into the solution using a callback like the bing AI suggested doesn’t work because Radau does not support the time step integrator.u being edited. This would work for solvers other than radau

What does work for radau is to add an if statement into my real dfdt that checks the condition and sets df to NaN if the condition is met. The code that’s vulnerable to the square root domain error is only executed if the condition is not met. This is the solution/workaround to my specific problem.


@abraemer found another issue with my MWE that would also fix the problem I thought I found (but wouldn’t get radau() to work, see above for what would):

@eval

@eval is essential because my real code is being called from a sbatch slurm script on a high performance computing environment with multithreading. I have found that if I don’t use @eval and invokelatest errors occur. That said my MWE runs on Windows and it doesn’t require @ eval.

Anonymous function
This fixes it, for the case without radau!

Updated MWE

Here is my MWE code without @eval, formatted correctly, and using an anonymous function. It now works as expected with both the custom exception and the intentional domain error. Using an anonymous function in the callback as @abraemer suggested fixes the problem.

module julia_try_catch_ode

using DifferentialEquations

export startup

struct SpatialError <: Exception end

function startup()
	dfdt_0 = [0.0]
	tbounds = [0.0, 5.0]
	dfdt_params = []
	dfdt_prob = invokelatest(ODEProblem,dfdt, dfdt_0, tbounds, dfdt_params)
	dfdt_sol = invokelatest(solve,dfdt_prob)
end

function dfdt(df, f, p, t)
	df = [-1.0]
	internal_IC = [0.0]
	internal_bounds = [0.0, 5.0]
	internal_params = []
	#test_callback = DiscreteCallback(example_callback, (integrator) -> sqrt(-1))
	test_callback = DiscreteCallback(example_callback, (integrator) -> throw(SpatialError))
	internal_prob = ODEProblem(dfdt_2, internal_IC, internal_bounds, internal_params)
	try
		doomed_solve = solve(internal_prob, callback = x_first_callback)
	catch e
		print("doomed solver included. No custom exception. \n")
		print(t)
		print("\n")
		print("e\n")
		print(e)
		print("\n")
	end
	return df
end

function dfdt_2(df, f, p, t)
	df = [-1.0]
	return df
end

function example_callback(var, t, integrator)
	return true
end

end

I also used to run Julia on HPC and can tell you that you definitively do not need @eval. Please make a new thread describing your issue that led to you adding @eval. Avoiding @eval and invokelastest will likely speed up your code significantly as invokelatest prohibits efficient compilation!

Thank you for the advice! I don’t have time to look into it at this moment, and its been a while since I moved everything to @eval. When I have a bit more time I’ll do some tests to confirm if it’s still necessary. I’ve tried several different parallelization methods before finding one that worked well so there’s a possibility that the interaction that made it necessary isn’t present anymore. If it is still necessary, when I have a chance, I’ll put together a good MWE illustrating the problem.

As @abraemer pointed out above, a function like this will not mutate df, so you will not be integrating the ODE you intend to integrate. If you use this pattern in your real code, your results will be incorrect, regardless of the callback issue. You need something like df .= -1.0 (note the dot in .=).

That aside, throwing and catching an error or injecting NaN are both rather crude ways of forcing termination. The supported way is to call terminate!(integrator) in the callback. You can then check if sol.retcode == SciMLBase.ReturnCode.Terminated and handle the situation accordingly.

radau() doesn’t support the terminate!(integrator) callback. That is the main reason to go to all this trouble. The terminate! callback was the first thing I tried.

In my real code, editing the df variable is done as follows (only showing the relevant pieces):

For the temporal integration:

function dCVdt_adaptive_step_size(dCV, CV, p, t)
dCV[1] = dCstlbrVbr_dt
dCV[2] = dCstldrVdr_dt
return dCV

For the spatial code

function dchi_dz_pos(dchi_dz_return, chi_vec, p, z)
dchi_dz_return .= dchi_dz_val[1]
dchi_dz_return # This is the last line of the function

There are three versions of the spatial code function for different portions of my system but this portion is handled the same in all of them.

Both produce the expected results and successfully solve relevant test cases. Please let me know if there’s an error in either of them, but everything I see is consistent with them working. For the vast majority of cases, the model works very well. I’m currently trying to troubleshoot a very specific set of inputs, but that’s beyond the scope of this thread.

1 Like

Right, my apologies, I missed that part of your original post.

The code you’re showing does indeed mutate dCV and dchi_dz_return correctly, so no worries there. Do note that you’re not expected to return these arrays, although doing so does no harm. My own preference is to put return nothing at the end of such functions as an extra indicator that they work via mutation or other side effects, but of course, you don’t have to copy my style.