Floating point exceptions on the gpu

MrBelette · March 23, 2025, 11:18pm

I’m wondering how people treat floating point exceptions on the gpu, such as division by zero or overflow?
I’m not sure about other gpus, but I believe that Nvidia gpus don’t have hardware capability to detect IEEE fp errors. So, detection must be done in software.

CUDA.jl does allow for exceptions and, for example, there is a gputhrow macro in device/quirks.jl along with some defined exceptions, such as for a domain error.
However, I believe that floating point division by zero errors will simply propagate as infs and nans.

These can, of course, be checked manually, and functions such as isfinite, isnan, isinf and issubnormal also seem to work on the gpu.
But I’m wondering if there is a better solution?

I have seen FPChecker, which instruments kernels via an LLVM pass. I’m wondering if that could be readily incorporated into GPUCompiler.jl.

Macros seem to be the obvious Julia solution for this.
I have a toy macro that looks for assignment expressions and adds simple instrumentation such as line number and error count for expressions containing division operators. One can choose to accumulate error reasons and counts via global memory variables or to throw at point of exception.

As I said at the beginning, I’m interested in learning of other people’s experience on this topic. Thanks.

maleadt · March 24, 2025, 8:16am

That probably requires some work. FPChecker seems to only provide compiler driver wrappers, so I’m not sure how reusable it is as a library. GPUCompiler currently doesn’t offer any way to easily extend its compilation or optimization pipeline without essentially redefining parts of it.

Topic		Replies	Views
Division by zero runs without warning -> complicates finding bugs Internals & Design question	31	2767	May 17, 2024
Trap floating point exceptions Numerics	0	664	June 8, 2017
Is software-checked integer division actually necessary? Internals & Design	5	549	December 6, 2023
Treating NaN as error: Helping debugging General Usage question , debugging , nan	19	3852	October 29, 2024
Error handling in CUDA kernels GPU gpu , cuda	6	1383	April 26, 2022

Floating point exceptions on the gpu

Related topics