Division by zero runs without warning -> complicates finding bugs

We have been considering this issue for several years regarding overflow handling of FixedPointNumbers.
Fixed-point numbers are somewhat closer to integers than floating-point numbers. So, there is a plan to (re-)use the CheckedArithmetic.jl mechanism. (And now we are moving toward using OverflowContexts.jl.)

As it pertains to this topic, / is also a candidate to be “checked”. (For convenience, I have named the function for / checked_fdiv.)
If the overflow handling framework is completed with integers and fixed-point numbers, we will be able to check for division-by-zero of floating-point numbers as well (at least for user codes).

2 Likes

Seems fine?

julia> const FE_INVALID    = 0x1
0x01

julia> const FE_DIVBYZERO  = 0x4
0x04

julia> const FE_OVERFLOW   = 0x8
0x08

julia> const FE_UNDERFLOW  = 0x10
0x10

julia> const FE_INEXACT    = 0x20
0x20

julia> function setfpexceptions(f, mode, args::Vararg{Any,K}) where {K}
           prev = ccall(:feenableexcept, Cint, (Cint,), mode)
           try
               f(args...)
           finally
               ccall(:fedisableexcept, Cint, (Cint,), mode & ~prev)
           end
       end
setfpexceptions (generic function with 1 method)

julia> function fpinvs!(y, x)
           for i = eachindex(y, x)
               y[i] = 1.0 / x[i]
           end
       end
fpinvs! (generic function with 1 method)

julia> N = 1024; x = rand(N); y = similar(x); x2 = copy(x); x2[57] = 0.0;

julia> setfpexceptions(fpinvs!, FE_DIVBYZERO, y, x)

julia> setfpexceptions(fpinvs!, FE_DIVBYZERO, y, x2)
ERROR: DivideError: integer division error
Stacktrace:
 [1] /
   @ ./float.jl:412 [inlined]
 [2] fpinvs!(y::Vector{Float64}, x::Vector{Float64})
   @ Main ./REPL[7]:3
 [3] setfpexceptions(::Function, ::UInt8, ::Vector{Float64}, ::Vector{Float64})
   @ Main ./REPL[6]:4
 [4] top-level scope
   @ REPL[10]:1

julia> @benchmark fpinvs!($y, $x)
BenchmarkTools.Trial: 9566 samples with 194 evaluations.
 Range (min … max):  499.789 ns … 688.809 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     525.479 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   524.882 ns ±  22.429 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇     ▁▃     █       ▆▂     ▄ ▁                               ▂
  █▁▁▁▁▁██▃▄▁▄▃█▆▆▄▃▁▁▆███▆▆▆▄████▇▆▆▅▆▆▇▇▇▆▆▅▆▆▆▆▅▇▄▆▆▆▇▆▇▆▅▆▆ █
  500 ns        Histogram: log(frequency) by time        614 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark setfpexceptions(fpinvs!, FE_DIVBYZERO, $y, $x)
BenchmarkTools.Trial: 9170 samples with 189 evaluations.
 Range (min … max):  535.026 ns … 767.407 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     563.857 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   565.906 ns ±  23.874 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇      ▃     █▇     ▃▇▂     ▅▂                                ▂
  █▆▁▁▁▁▇█▆▃▃▁▃██▁▁▁▁▁███▇▆▄▅▃███▇▅▆▆▄▆██▆▆▆▆▇█▇▆▆▆▆▇▇▆▇▅▆▇▇▆▇▆ █
  535 ns        Histogram: log(frequency) by time        661 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

It’s mostly just the try/finally overhead in the setfpexecptions implementation above.

julia> function tryfinally(f, args...)
           try
               f(args...)
           finally
           end
       end
tryfinally (generic function with 1 method)

julia> @benchmark tryfinally(fpinvs!, $y, $x)
BenchmarkTools.Trial: 9247 samples with 190 evaluations.
 Range (min … max):  531.116 ns … 832.142 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     559.353 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   558.035 ns ±  25.422 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █     ▁▄     ▇█     ▃▆▂     ▄▃                                ▂
  ██▁▁▁▃██▄▃▄▃▁██▁▁▄▁▅███▆▄▄▃▄████▇▇▅▆▆▇█▇▆▅▆█▇▇▇▇▇▇▆▇▇▇▆▆▇▆▇▆▆ █
  531 ns        Histogram: log(frequency) by time        655 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

The code SIMDs just fine:

julia> @code_llvm debuginfo = :none setfpexceptions(fpinvs!, FE_DIVBYZERO, y, x)
vector.body:                                      ; preds = %vector.body, %vector.ph.new
  %index = phi i64 [ 0, %vector.ph.new ], [ %index.next.3, %vector.body ]
  %niter = phi i64 [ 0, %vector.ph.new ], [ %niter.next.3, %vector.body ]
  %33 = getelementptr inbounds double, double* %arrayptr67, i64 %index
  %34 = bitcast double* %33 to <8 x double>*
  %wide.load = load <8 x double>, <8 x double>* %34, align 8
  %35 = fdiv <8 x double> <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>, %wide.load
  %36 = getelementptr inbounds double, double* %arrayptr2168, i64 %index
  %37 = bitcast double* %36 to <8 x double>*
  store <8 x double> %35, <8 x double>* %37, align 8
  %index.next = or i64 %index, 8
  %38 = getelementptr inbounds double, double* %arrayptr67, i64 %index.next
  %39 = bitcast double* %38 to <8 x double>*
  %wide.load.1 = load <8 x double>, <8 x double>* %39, align 8
  %40 = fdiv <8 x double> <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>, %wide.load.1
  %41 = getelementptr inbounds double, double* %arrayptr2168, i64 %index.next
  %42 = bitcast double* %41 to <8 x double>*
  store <8 x double> %40, <8 x double>* %42, align 8
  %index.next.1 = or i64 %index, 16
  %43 = getelementptr inbounds double, double* %arrayptr67, i64 %index.next.1
  %44 = bitcast double* %43 to <8 x double>*
  %wide.load.2 = load <8 x double>, <8 x double>* %44, align 8
  %45 = fdiv <8 x double> <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>, %wide.load.2
  %46 = getelementptr inbounds double, double* %arrayptr2168, i64 %index.next.1
  %47 = bitcast double* %46 to <8 x double>*
  store <8 x double> %45, <8 x double>* %47, align 8
  %index.next.2 = or i64 %index, 24
  %48 = getelementptr inbounds double, double* %arrayptr67, i64 %index.next.2
  %49 = bitcast double* %48 to <8 x double>*
  %wide.load.3 = load <8 x double>, <8 x double>* %49, align 8
  %50 = fdiv <8 x double> <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>, %wide.load.3
  %51 = getelementptr inbounds double, double* %arrayptr2168, i64 %index.next.2
  %52 = bitcast double* %51 to <8 x double>*
  store <8 x double> %50, <8 x double>* %52, align 8
  %index.next.3 = add nuw i64 %index, 32
  %niter.next.3 = add i64 %niter, 4
  %niter.ncmp.3 = icmp eq i64 %niter.next.3, %unroll_iter
  br i1 %niter.ncmp.3, label %middle.block.unr-lcssa, label %vector.body

middle.block.unr-lcssa:                           ; preds = %vector.body, %vector.ph
  %index.unr = phi i64 [ 0, %vector.ph ], [ %index.next.3, %vector.body ]
  %lcmp.mod.not = icmp eq i64 %xtraiter, 0
  br i1 %lcmp.mod.not, label %middle.block, label %vector.body.epil

vector.body.epil:                                 ; preds = %vector.body.epil, %middle.block.unr-lcssa
  %index.epil = phi i64 [ %index.next.epil, %vector.body.epil ], [ %index.unr, %middle.block.unr-lcssa ]
  %epil.iter = phi i64 [ %epil.iter.next, %vector.body.epil ], [ 0, %middle.block.unr-lcssa ]
  %53 = getelementptr inbounds double, double* %arrayptr67, i64 %index.epil
  %54 = bitcast double* %53 to <8 x double>*
  %wide.load.epil = load <8 x double>, <8 x double>* %54, align 8
  %55 = fdiv <8 x double> <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>, %wide.load.epil
  %56 = getelementptr inbounds double, double* %arrayptr2168, i64 %index.epil
  %57 = bitcast double* %56 to <8 x double>*
  store <8 x double> %55, <8 x double>* %57, align 8
  %index.next.epil = add nuw i64 %index.epil, 8
  %epil.iter.next = add i64 %epil.iter, 1
  %epil.iter.cmp.not = icmp eq i64 %epil.iter.next, %xtraiter
  br i1 %epil.iter.cmp.not, label %middle.block, label %vector.body.epil

Because the compilers are ignorant of hardware interrupts, they vectorize as though they don’t exist.
Unlike software checks and interrupts, which compilers do not vectorize.

This does mean that they brake semantics. That is, when you get an error, which results ended up actually getting stored is based on how much the compiler vectorizes. So, with a vector width of 8

julia> y .= NaN;

julia> x2[16] = 0.0;

julia> setfpexceptions(fpinvs!, FE_DIVBYZERO, y, x2)
ERROR: DivideError: integer division error
Stacktrace:
 [1] /
   @ ./float.jl:412 [inlined]
 [2] fpinvs!(y::Vector{Float64}, x::Vector{Float64})
   @ Main ./REPL[7]:3
 [3] setfpexceptions(::Function, ::UInt8, ::Vector{Float64}, ::Vector{Float64})
   @ Main ./REPL[6]:4
 [4] top-level scope
   @ REPL[19]:1

julia> y[1:16]'
1×16 adjoint(::Vector{Float64}) with eltype Float64:
 1.2321  5.48182  2.78002  1.43467  6.14824  8.84106  2.46354  2.39465  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN

even though it was element 16 that was invalid, because it is doing batches of 8, all of 9-16 are NaN!
This might be surprising behavior (and is why compilers don’t vectorize when you have software exceptions).
In more complicated examples, someone could easily go down a rabbit hole under the assumption that x2[9] was the first bad element

julia> x2[1:16]'
1×16 adjoint(::Vector{Float64}) with eltype Float64:
 0.811622  0.182421  0.35971  0.697022  0.162648  0.113109  0.40592  0.417598  0.186927  0.718609  0.870661  0.946411  0.718573  0.4715  0.319029  0.0

and tear their hair out trying to figure out why it looks good on closer inspection, not realizing they need to validate the entire chunk that happened to be in the same SIMD vector.
So I get why this can be bad.
But I could live with it. :wink:

4 Likes

What’s with the exceptions and the global flags anyway? From my ignorant naive perspective, wouldn’t it be easier to specify values or sets of values for any pathological condition we need? There’s millions of NaN values to spare.

the problem is that even if you specify what they do, you still need to specify what happens when you combine nans, and there is no good way to do that (especially if you want to preserve commutativity)

1 Like

These ideas were really useful especially when you are trying to figure out divide by zeros and zero
Divided by zero in differentiated code where it is often not one’s first thought to put divide by zero protection on the derivative a function as well. Maybe as this becomes standard practice, you learn how to do this naturally as a programmer.

I’m used to MATLAB and searching for nans and INF in the output. Julia’s adherence to those “standards” are fine with me because running fast and also not blowing chunks on an embedded system is what I prefer. We need to write the code to not execute paths that can divide by zero on those conditions even in c and c++, Or at least rescue it afterwards. The trick is getting it to not execute on a derivatives divide by zero which is not always obvious. This way the automatic differentiation tools work consistently. Is this an unusual opinion? I always learn so
Much from these forums.

The globalness is the other issue here. What if some subroutine produces a NaN that’s fine, which is then handled? Now it throws before you can even get to the exception that you actually want. But maybe that’s fine, it’s just totally non-composable.

4 Likes

The original idea for having so many NaNs was that you would encode an instruction pointer or other tag in the NaN payload, so that later when you get a NaN result you can say “Ick! Where did this come from?” and get an answer from the payload. Unfortunately the hardware hasn’t really committed to this approach so it really isn’t workable. What I would want to make this actually work are the following features/guarantees:

  • If some of the arguments to an instruction are NaN and the returned value is NaN then the payload of the returned NaN is the payload of one of the arguments. Ideally the one that is the “nonstandard NaN” if any.
  • Operations that can produce NaNs should take a “default” argument that specifies what value to return in the NaN case.
  • Compilers should provide APIs to fill in the default value with something useful like a NaN with a payload that’s a pointer to some line info.

But none of this is available, so NaN payloads are just a useless waste of space.

2 Likes

Yeah.
AVX512 supports exception control at the instruction level.
I’m not sure what all the options are, but you can at least do vdivpd zmm1 {k1}{z}, zmm2, zmm3{sae} to suppress all exceptions.

So at least in theory, SAE could be applied by default, except for select instructions where you want to be able to enable an exception.

1 Like

I seem to remember being at a talk where someone mentioned that they were using NaN boxing, and they kind of apologized for it as an ugly hack … but William Kahan was in the audience and spoke up to say that this was exactly the sort of thing they had intended NaN payloads for. See also a nice 2019 talk on NaN boxing for Javascript.

4 Likes

That’s great! I looked into overflow, and other flags (for integers) and they are set, and then potentially overwritten, i.e you have to check right away, I understood. Are floating-point exceptions sticky? The Alpha had non-precise exceptions, otherwise they are a bitch for hardware implementors. [It only takes one good arch, e.g. x86, or Arm64 with good semantics.]

I’m thinking if they are sticky for FP (or could be hypothetically… also for integers), could you just check them after loops? That would be ok in many cases, less for e.g. calls to other functions, e.g. exp. You might have no idea were overflow or divide by zero happened, but hopefully inlining to the rescue…

Yes, FP flags are sticky.
At least in C++, the overhead of checking them using cfenv is non-zero (or equivalently, of c with fenv.h).
There are non-inlined function calls to check the state, so it isn’t free.

Setting them to trap has no overhead when they do not actually trap.
Turning trapping on and off does, of course, have overhead, probably similar to checking if a flag is set. But it has been a while since I have benchmarked.

Also, as a correction to compilers assuming no tracking math, that is of course configurable, but generally the preferred setting, otherwise they probably won’t vectorize much.