Treating NaN as error: Helping debugging

I am writing a large hydrological model, and for some sets of parameters I get a NaN value, but I do not know which first function is giving me NaN. Therefore, I would like to stop my program at the first NaN.

A dirty debugging solution would be to test at every level isnan(…) which is tedious.

I am therefore wandering if there is an option in Julia to treat NaN as an error which will help to determine which function is causing the issue?

There is no option that tells Julia to treat NaN as an error. NaN arises from one of these (where the signs may be positive or negative):

Inf / Inf
0.0 / 0.0
Inf % x

and inv(0.0) == Inf; x / 0.0 == Inf

Look at the routines to see where you may be dividing by zero (or introducing Inf).

chk(x)  = (iszero(x) || isinf(x)) && error("x = $x")
function chk(x,y)
    (iszero(x) || isinf(x)) && error("x = $x")
    (iszero(y) || isinf(y)) && error("y = $y")
end
    
function fn(a, b)
    chk(a, b)
    return a / b
end
3 Likes

NaNs are valid Float64s, engineered precisely for the purpose of making invalid results propagate without errors.

Your best option may be placing a few checks that validate outputs (or better, inputs). I find isfinite useful for this purpose (it may catch a few things that turn to NaNs later). I agree that it can be a bit cumbersome, but it can quickly pinpoint a problem.

Hmm, Fortran compilers let you enable floating-point exceptions, e.g. gfortran -ffpe-trap=invalid will cause an exception once a NaN is created. I think technically this will cause a SIGFPE signal that can be caught by a signal handler. Would there be a way to do something similar in Julia, e.g. a library function to enable those exceptions and handling the signal will result in a backtrace?

Update: there’s an open issue on this, https://github.com/JuliaLang/julia/issues/27705

4 Likes

I wrote GitHub - jwscook/ElideableMacros.jl: Elidable macros in Julia, that can be compiled out as a way to try to understand macros (it turns out I didn’t master them). I’m fairly certain the hygiene is wrong, but it allowed me to create @elideableassert, which can be used with @elideableassert !isnan(x). The elision works by reading ENV["ELIDE_ASSERTS"]. This is relatively close enabling something like gfortran’s compiler options as pointed out by @traktofon.

Use with caution! It’s buggy.

2 Likes

NaN treated as an error

Is there any new features/options in Julia V1.6.4 where one can treat NaN as an error and therefore I can determine where in my code NaN is being produced?

Many thanks for any suggestions

1 Like

It would be nice if Julia would support signaling NAN, but this seems to be an open issue. Not sure what LLVM does support here, though.

You might be interested in this old post:

2 Likes

We ran into similar problems with NaN popping up when we computed derivatives with ForwardDiff; finding the source was very difficult. We wrote this utility type to make it easier.

It probably doesn’t support every floating point operation but it was enough for our use. Hasn’t been tested under 1.7.

Call your functions with NanCheck instances instead of floats:

struct NaNCheck{T<:Real} <: Real
    val::T
    function NaNCheck{T}(a::S) where {T<:Real, S<:Real}
        @assert !(T <: NaNCheck)
        new{T}(T(a))
    end
end
export NaNCheck
Base.isnan(a::NaNCheck{T}) where{T} = isnan(a.val)
Base.isinf(a::NaNCheck{T}) where{T} = isinf(a.val)
Base.typemin(::Type{NaNCheck{T}}) where{T} = NaNCheck{T}(typemin(T))
Base.typemax(::Type{NaNCheck{T}}) where{T} = NaNCheck{T}(typemax(T))
Base.eps(::Type{NaNCheck{T}}) where {T} = NaNCheck{T}(eps(T))
Base.decompose(a::NaNCheck{T}) where {T} = Base.decompose(a.val)
Base.round(a::NaNCheck{T}, m::RoundingMode) where {T} = NaNCheck{T}(round(a.val, m))

struct NaNException <: Exception end

# (::Type{Float64})(a::NaNCheck{S}) where {S<:Real} = NaNCheck{Float64}(Float64(a.val))
(::Type{T})(a::NaNCheck{S}) where {T<:Integer,S<:Real} = T(a.val)
(::Type{NaNCheck{T}})(a::NaNCheck{S}) where {T<:Real,S<:Real} = NaNCheck{T}(T(a.val))
Base.promote_rule(::Type{NaNCheck{T}}, ::Type{T}) where {T<:Number} = NaNCheck{T}
Base.promote_rule(::Type{T}, ::Type{NaNCheck{T}}) where {T<:Number} = NaNCheck{T}
Base.promote_rule(::Type{S}, ::Type{NaNCheck{T}}) where {T<:Number, S<:Number} = NaNCheck{promote_type(T,S)}
Base.promote_rule(::Type{NaNCheck{T}}, ::Type{S}) where {T<:Number, S<:Number} = NaNCheck{promote_type(T,S)}
Base.promote_rule(::Type{NaNCheck{S}}, ::Type{NaNCheck{T}}) where {T<:Number, S<:Number} = NaNCheck{promote_type(T,S)}

for op = (:sin, :cos, :tan, :log, :exp, :sqrt, :abs, :-, :atan, :acos, :asin, :log1p, :floor, :ceil, :float)
    eval(quote
        function Base.$op(a::NaNCheck{T}) where{T}
            temp = NaNCheck{T}(Base.$op(a.val))
            if isnan(temp)
                throw(NaNException())
            end
            return temp
        end
    end)
end

for op = (:+, :-, :/, :*, :^, :atan)
    eval(quote
        function Base.$op(a::NaNCheck{T}, b::NaNCheck{T}) where{T}
            temp = NaNCheck{T}(Base.$op(a.val, b.val))
            if isnan(temp)
                throw(NaNException())
            end
            return temp
        end
    end)
end

for op =  (:<, :>, :<=, :>=, :(==), :isequal)
    eval(quote
        function Base.$op(a::NaNCheck{T}, b::NaNCheck{T}) where{T}
            temp = Base.$op(a.val, b.val)
            return temp
        end
    end)
end
6 Likes

The nansafe_mode of ForwardDiff.jl can help with debugging NaNs.

1 Like

I may well be totally mistaken, but I have an impression that the NaN signal (the “invalid-operation exception” of IEEE) is generated at the hardware level. If that’s the case, it’s Julia’s interpreter that decides to ignore the signal and it would be possible to have a switch on the interpreter to catch the signal and convert it to a Julia exception.

I think that’s basically what the Fortran compiler does depending on flags given to the compiler.

The Fortran 2003 standard then starts to allow the programmer to control signal handling within the language. Now that it’s a language feature, the programmer doesn’t have to rely on compiler options.

I’m just very surprised that/if Julia totally lacks these features.

Bumping this. Would be very useful feature.

GitHub - utahplt/TrackedFloats.jl: Julia library providing tracking of floating point errors through a program resources might be useful; they had a very nice JuliaCon 2023 talk:

3 Likes

Trapping on NaN is controlled by various control registers on the hardware level.

The annoying part of signaling NANs is that this massively complicates the semantics of the code, which prevents lots of compiler optimizations / code transformations. Without trapping, we can consider x = y / z for floating point numbers as side-effect-free. This means that it can be speculatively executed and reordered with other instructions. If it has potential side-effects, then everything sucks. So properly supporting signaling nans can be a major compiler effort.

Something that would make a lot of sense with hopefully limited engineering effort, is support in the interpreter. Ideally we would also include all kinds of other nice debug features, e.g. a kind of limited UBSAN that tries to catch known cases where compiled and interpreted behavior differ (otherwise it would be quite unfun to find out that your nans only happen in compiled code, due to presumably broken pointer shennenigans / bugs).

Sure your code will be slow, but you already have a deterministic reproducer for your bug. Letting it churn over a weekend until you get your stack-trace is much better than the status quo!

1 Like

hackish way:
Base.:/(x::Float64, y::Float64) = iszero(y) ? error("nan") : Float64(Float32(x) / Float32(y))
need to overload Base if want to affect imported packages. now workaround to avoid endless recursion

Does Julia have “optimization options” to disable such optimizations? As you say, it would be nice if the programmer can choose to catch exceptions, sacrificing some optimization opportunities.

What a Fortran compiler does is

  • If certain optimization options are off, floating-point exceptions are guaranteed to be caught at the spots of occurrence.
  • If those options are on, floating-point exceptions may be ignored or reported at the wrong places.

From Fortran 2003 on, the language standard formalizes the handling of IEEE floating-point exceptions. So, the programmer can decide to catch floating-point exceptions, sacrificing some optimization opportunities.

Don’t think this is build into the compiler currently. On the other hand, Julia has a solution which is enabled by its composability of generic code and just as good. Run your code with a dedicated number type that has the semantics you want, i.e., the TrackedFloats mentioned by @ericphanson provide just that “compiler switch”.

At least on linux you can enable floating point exceptions. Something will go wrong, but if julia manages to report a reasonable stack etc, I don’t know.

#From /usr/include/x86_64-linux-gnu/bits/fenv.h
const FE_INVALID   = 0x01
const FE_DENORM    = 0x02
const FE_DIVBYZERO = 0x04
const FE_OVERFLOW  = 0x08
const FE_UNDERFLOW = 0x10
const FE_INEXACT   = 0x20

julia> a = 0.0 / 0.0
NaN

julia> st = @ccall feenableexcept((FE_INVALID | FE_DIVBYZERO)::Cint)::Cint
julia> st ≠ 0 && @ccall perror("feenableexcept"::Cstring)::Cvoid

julia> b = 0.0 / 0.0
ERROR: DivideError: integer division error
Stacktrace:
 [1] /(x::Float64, y::Float64)
   @ Base ./float.jl:494
 [2] top-level scope
   @ REPL[10]:1
2 Likes

If you think about it, you will realize what you are asking doesn’t really make sense. Let me explain why.

Firstly, let me point out that this is not a Julia specific issue/question. (That isn’t necessarily that obvious however.)

Julia code is JIT compiled using LLVM. LLVM is used by other compilers such as g++ (C++), and others.

If you think about it, an expression which operated on floating point numbers must compile down to some machine instructions which load some data into some registers and cause some floating point unit to start executing some hardware logic to compute a result of an operation. There will be hardware implementations for add, multiple, divide, etc…

On the other hand, errors and exceptions more generally are higher level constructs which require additional logic to propagate up the stack.

Edit: I’ve realized what I’ve written here might be confusing. By exception here I mean the thing that the compiler produces which can be emitted from one place in the stack, the stack then unwinds, and it can be caught elsewhere. As far as I am aware this is a separate concept to interrupts and OS level signaling, which is what is described in some of the other comments here relating to the example in Fortran whereby you can cause the compiler to treat NaNs as program interrupting conditions.

Another way to think about it. In C, errors are represented by return values. There are no exceptions in C. The compiler doesn’t build the machinery to emit, handle and propagate exceptions through different layers of the call stack. They just don’t exist.

To circle back then, the existance of a NaN is your return value which signals “error”. Actually, it may not be an error. NaN in => NaN out. If you put a NaN into some calculation it isn’t an error if a NaN comes out. It just means the calculation can’t run because the data is missing. (Effectively this is what it says.)

Whether you interpret that as missing data or an error is up to you.

  • So you should check for NaNs when you consider their presence to be an error, and emit an exception when this happens
  • The most sensible place to check is just before calling a function (check the argument values) or just after recieving a return value (check the return value)
  • Alternatively, checking the values of arguments at the start of a function is also a good place
  • This will give you the exception like behavior you want

I’m intrigued by this idea. Asides from being “automatic”, what would the advantage be? One disadvantage is that the behavior produced by code changes depending on a compiler switch. This can be a source of problems between release and debug builds. (Not necessarily in Julia but languages generally.) I’m not sure adding another compiler behavior changing switch is a good idea from a reproducibility point of view.


Also suggest you look into Rust’s OrderedFloat concept. You could do a similar thing to detect NaNs whereby you wrap the float in some higher level type, and perform checks there.

Oh I see that might be what @brianguenter has suggested. (My Julia isn’t quite good enough to read this code yet.)

Just to point out, people will say things like “it would be nice for the programmer” meaning it would be convenient for a solo developer in some situations. When working in a team, additional environment complexities such as compiler behavior switches make things more complex.

I’m not sure which trade off is the better option - just raising it as a potential issue.

That’s the intended behavior of “quiet NaNs”. The IEEE standard also includes “signaling NaNs”, whose purpose is to raise exceptions:

(quoted from NaN - Wikipedia )

Ideally, the decision whether to suppress the exception or not, should be in the hands of the programmer.