Subtract Float32 number from Float64 number - what's the rule?

I am trying to understand what happens when I subtract a Float32 from a Float64 where the numbers are “very small”.

I understand what happens when both are Float64

julia> 1.401298464324817e-45 - 1.0e-45

but if the second number is Float32 I get zero

julia> 1.401298464324817e-45 - 1.0f-45

julia> bitstring(ans)

This surprised me. What is the general rule that leads to this result?

The general rule when doing arithmetic on numbers of different types is that they each get promoted to a common type. For Float32 and Float64 the former gets promoted to Float64 and the latter stays unchanged.

In other words, you get

julia> Float64(1.0f-45)

What happened? Actually 1.0f-45 is not 1.00000000000f-45 but rather

julia> big(1.0f-45)

So you end up with zero.


See also

In particular, 1.0f-45 is internally rounded to a “subnormal” Float32 value that effectively has only a single significant digit in binary, so it’s actually 2^{-149}

julia> 2^-149

(In single precision, it prints as 1.0f-45 because that’s the shortest decimal value that rounds to the same Float32 value. This is fundamental to how floating-point numbers are printed in decimal.)

If you try to make the number any smaller in single precision, it underflows to zero:

julia> 1.0f-45 / 2

If you want to use Float32 arithmetic with the full ≈7-digit precision, you need to stay between \approx 10^{-38} and \approx 10^{38}:

julia> floatmin(Float32), floatmax(Float32)
(1.1754944f-38, 3.4028235f38)

In fact, for performance reasons, you sometimes want to just treat subnormal values as zero:

julia> 3.0f-45 / 2 # yields subnormal

julia> set_zero_subnormals(true)

julia> 3.0f-45 / 2 # now underflows to zero

This should be doubly treated with caution, as that setting affects every floating point calculation on the same process, as far as I remember.

So changing this “for performance” is a double edged sword.


Setting underflows to zero is one of those things that “works great, except when it doesn’t”. I’ve never seen the performance benefit amount to much and @Sukera makes a very sound point about the risks.

I tried this for a while and abandoned it.


As a counter-example, I’ve seen order-of-magnitude slowdowns from subnormals in real applications involving wave simulations, where the solutions can be exponentially small in large fractions of the domain until the wave hits them. Other researchers have published similar results.

You should never disable subnormals in a library, for the reason @Sukera pointed out — the setting is global to the process, and libraries should be composable by design. However, it’s not so strange to do so in an application you control. It’s extremely rare in my experience for correctness of an application to fail when subnormals are flushed to zero, since if you are working with such small magnitudes then you are surely seeing lots of underflow effects already and your algorithm must be generally robust to underflow if it is working at all.


Of course, it’s not a black/white issue at all - but it’s also important to note that disabling subnormals is not a performance panacea. Whether computing them is actually slower or not strongly depends on how it’s implemented in hardware, and how many normal-subnormal interactions (which are usually what makes them slow) you end up with in your particular algorithm.

So disabling them can unfortunately only be done on a case-by-case basis.

I had a case last week when a colleague found that some of her neural networks were several times slower than other ones when running inference in ONNXRunTime on CPU. It turned out that the slow ones had subnormals among the weights. There the solution was trivial though, just flush those coefficients to zero and re-save the models.



Thank you @DNF and @stevengj for the explanation. Looking at the Julia documentation you referenced Treat-Subnormal-Numbers-as-Zeros I think I am safe to use set_zero_subnormals(true) in my code. My code has similar behaviour to the example in the documentation so I may get a performance improvement.


Subnormals could be documented better. They are, and I have the get_zero_subnormals() function in mind, which is in docs, just it not mentioned there. So I’m thinking is it very rare to actually use it?

I see in docs “but incur a performance penalty on some hardware.” Isn’t that on most or all hardware? Or do you have any counterexample? Or the opposite, is it known fast on most by now, x86, and ARM?

It’s actually global, but to the thread, not the process. Likely doesn’t matter if (only) set at start of program. Also documented (but only elsewhere in docs) for get_zero_subnormals():

This function only affects the current thread.

Your statement made me think though, is this done much, and then would you set once in your program, at the start, for the main thread (and well if using Julia with many threads, a non-default option, that I think the others would inherit automatically), and forget about it. I.e. it would apply to your program, including (Julia) libraries you call, and e.g. if you call to C or Python etc. (but not when calling Java, with JavaCall.jl or for MATLAB.jl, i.e. languages where interop is implemented in a different process, like they do, on same or different machine).

Since libraries should compose, would you assume most would work with subnormals off, with that non-default setting, at least for all in Julia standard library?

I tried to find how you would set the FPU environment (for this) from C or C++, but I wasn’t successful, only seeing how done with an x86 instruction. I suppose they and other languages have the capability too. Or this is an indication that this is not much done, or thought about with those languages, just assume what the hardware provides.

And some hardware, at least older ARM didn’t support subnormals, while I believe all recent does, is fully IEEE conformant, but still has the option of disabling.

The temptation to not support subnormals is there since it complicates the hardware. At least addition and multiply are rather straight-forward without this (and e.g. NaN) complications. I haven’t thought through mow much more complex it is to implement it in hardware, that is over just detecting subnormals and trapping (or flushing to zero). Which I believe is the norm. It can always be implemented by trapping in software (firmware the right word here?), and I think done since this is assumed to be very rare anyway.

Even if you have a counterexample where mul and/or add, or muladd support in hardware, and do not trap, I have low confidence that e.g. square root would. So for a stronger claim, do you think any hardware supports all operations in hardware and (as) fast with subnormals? If not I think we can change the docs.

[I recall in older ARM (then only integer) multiply took many cycles depending on the numbers involved. By now, most if not all, CPUs have float (and I believe integer) multiply single-cycle. Except for subnormals, I doubt any, nor any GPU, can do that.]

I see C++ has (and Julia doesn’t, nor needs):

Seems redundant with their isnormal, we have similar to that (well the opposite), and both of those are slightly changed in C++23 (just to make general? with constexpr).

That’s some very unusual code (in a loop) in the docs:

    set_zero_subnormals(iseven(trial))  # Odd trials use strict IEEE arithmetic
    @time heatflow(a,1000)

Is it just to illustrate the timing difference, I mean you would never do for performance or other reasons to alternate, as I said just set globally and forget. In fact, is the setting of the FPU default assembly instruction know to be fast? I’m just not sure it is, or would trust that, so a dubious example for the performance section?

In some applications, an alternative to zeroing subnormal numbers is to inject a tiny bit of noise. For example, instead of initializing a with zeros, initialize it with:

a = rand(Float32,1000) * 1.f-9

Is this trick much used? Also more likely for Float32/lower precision? Are subnormals less valuable for Float64, since higher precision to begin with?

@fastmath is still there in the language (but I know recently changed, not sure if the docs are outdated on that), but at least this is outdated (my PR to disable that option siletly was merged a long time ago):

Here, the option --math-mode=ieee disables the @fastmath macro, so that we can compare results.

I find an application of set_zero_subnormals to more than a small block of code to be spooky. Not nearly as terrifying as the reckless application of @fastmath, but still. I also don’t expect it to be a profound improvement for performance outside of pathological cases where subnormals form a significant fraction of your inputs/outputs.

Subnormals make up a tiny fraction of the representable numbers and lie at the very edge of representable space. For Float64, there are 616 orders of magnitude for normal numbers and 16 for subnormals (2.6% of the dynamic range, or 9% for Float32). They mostly emerge from the sum/difference of very small numbers (smaller than floatmin(T)/eps(T)) and products/quotients that almost-but-don’t-quite underflow to zero. If numbers are trending towards gradual underflow at some vaguely-consistent rate, they may pass through the subnormals but won’t spend long there. So, typically, you’ll only pay the higher cost a small fraction of the time. But in a scenario of gradual underflow (or overflow) you should also consider whether you can move the calculation to the log domain to not underflow at all. If a lot of values are landing in the subnormals, you can also consider re-scaling your problem so that they don’t. If you’re working in Float32 and are concerned about subnormals, consider moving to Float64 if feasible.

Though, if you have a model parameters that end up subnormal but that you don’t need to keep, by all means go ahead and zero them. At that point I’d probably just zero everything “very small” as well. I’ve never seen this happen in my work, but I’m sure it does for some people.

My complaints aside, the cost of calculations with subnormals can be staggering. Here I see a 15-100x performance difference on my hardware.

julia> using BenchmarkTools

julia> foo(x) = x*(1+x);

julia> @btime foo($1e0)
  2.300 ns (0 allocations: 0 bytes)

julia> @btime foo($1e-315)
  34.375 ns (0 allocations: 0 bytes)

julia> x = fill(1e0, 1000); xsn = fill(1e-315, 1000);

julia> @btime sum(foo,$x)
  79.339 ns (0 allocations: 0 bytes)

julia> @btime sum(foo,$xsn)
  8.100 μs (0 allocations: 0 bytes)
1 Like

Regarding the safety of disabling subnormals:

  1. floating-point code that’s carefully designed for achieving good accuracy, using error-free transformations for example, might behave worse than expected when subnormals aren’t available

  2. some “common sense” properties, such as {a \neq b} \implies {{a - b} \neq 0} (where a - b is floating-point subtraction) only hold when subnormals are available

So I don’t think that getting rid of gradual underflow (subnormals) is safe unless it’s done only while your own code is running, without calling external code.