I’m wondering how people treat floating point exceptions on the gpu, such as division by zero or overflow?
I’m not sure about other gpus, but I believe that Nvidia gpus don’t have hardware capability to detect IEEE fp errors. So, detection must be done in software.
CUDA.jl does allow for exceptions and, for example, there is a gputhrow macro in device/quirks.jl along with some defined exceptions, such as for a domain error.
However, I believe that floating point division by zero errors will simply propagate as infs and nans.
These can, of course, be checked manually, and functions such as isfinite, isnan, isinf and issubnormal also seem to work on the gpu.
But I’m wondering if there is a better solution?
I have seen FPChecker, which instruments kernels via an LLVM pass. I’m wondering if that could be readily incorporated into GPUCompiler.jl.
Macros seem to be the obvious Julia solution for this.
I have a toy macro that looks for assignment expressions and adds simple instrumentation such as line number and error count for expressions containing division operators. One can choose to accumulate error reasons and counts via global memory variables or to throw at point of exception.
As I said at the beginning, I’m interested in learning of other people’s experience on this topic. Thanks.