[RFC] What should the arithmetic within the FixedPointNumbers be

Hi all,

I am one of the committers of FixedPointNumbers.jl and we are currently working on the next version, v0.9.0, to be released.
This is not directly related to the visualization, but I’m writing this in the visualization category because the JuliaImages and JuliaPlots communities will be most affected by the decision of this issue.


Can you accept that the FixedPoint computation suddenly slows down or that the existing code throws an error?
Or, would you prefer to ignore the overflow and get faster speed?
Or, do you prefer to maintain the status quo as much as possible?


As you know, in Julia, the 8-bit N0f8 type defined in FixedPointNumbers is often used to represent 8-bit gray or 24-bit RGB color. Since types like N0f8 easily cause overflow in the addition and subtraction, the issue with overflow has long been discussed. @tim.holy summarized the proposals discussed in the past year.

In the meantime, I had been frustrated with the implementation of arithmetic operations in FixedPointNumbers. As an extreme example, the multiplication between N0f8 numbers had room to be 25x faster.

So, I decided to improve the implementation of arithmetic in the next v0.9.0. Also, in the improvement, I decided to support three types of arithmetic (i.e. wrapping, saturating and checked) as a measure to mitigate the overflow problem. And then, the implementations are almost complete (cf. https://github.com/JuliaMath/FixedPointNumbers.jl/issues/41#issuecomment-674682747).

Thus, with FixedPointNumbers#master, you can try the following code.

julia> using FixedPointNumbers, Base.Checked

julia> 0.8N0f8 + 0.8N0f8

julia> wrapping_add(0.8N0f8, 0.8N0f8)

julia> saturating_add(0.8N0f8, 0.8N0f8)

julia> checked_add(0.8N0f8, 0.8N0f8)
ERROR: OverflowError: 0.8N0f8 + 0.8N0f8 overflowed for type N0f8

Of course, it’s annoying for the end-user or the developer of a high-level package to specify the arithmetic for every operation. Therefore, I believe that we need some kind of tools, and one of those tools is CheckedArithmetic.jl made by @tim.holy.

julia> using CheckedArithmetic

julia> @checked 0.8N0f8 + 0.8N0f8 # `@checked` replaces `+` with `checked_add`
ERROR: OverflowError: 0.8N0f8 + 0.8N0f8 overflowed for type N0f8

Although there are still some issues in using CheckedArithmetic with FixedPointNumbers, I’ll be working on improving CheckedArithmetic as well.

These new arithmetic implementations do not essentially solve the overflow problems and we still need to discuss them. However, I believe that releasing the new arithmetic implementations is a step forward.


In the run-up to the release, we have to make at least two decisions.

API definitions (or release cycle)

One is which modules should define the interfaces of wrapping_* and saturating_* and whether they should be exported by FixedPointNumbers.

This may be not a important decision for users. However, I’m thinking of putting those interface definitions in CheckedArithmetic (CheckedArithmeticCore). In that case, the release of v0.9.0 will take a little longer. If people want to use the latest version sooner, we’ll release them as private APIs first, and then will change the owner of the APIs after v0.9.0 is released. In order to be actually available, v0.9.0 has to be supported by downstream packages, such as ColorTypes.jl, and that will still take some time, though.

Default arithmetic

The other thing, and this is an important decision, is what the default arithmetic should be, e.g. what the result of 0.8N0f8 + 0.8N0f8 should be.

I’m not going to vote on this, but I will describe the options individually to make it easier to reply.


Option 1. (mostly) status quo

operation Fixed Normed
+, - wrapping wrapping
* wrapping checked
/ checked checked
div, rem etc. checked checked

The division-related operations are “checked” in the sense that errors can be thrown, even before v0.8. However, the condition and the type of the errors will be changed. (I’m just referring to the differences from v0.8 here, and the specifications of checked_div etc. are the same for other options.)

Option 2. consistent with Integer

operation Fixed Normed
+, - wrapping wrapping
* wrapping wrapping
/ checked checked
div, rem etc. checked checked

The difference between this and Option 1. is * for Normed. This option improves the consistency between Fixed and Normed and the consistency with Julia’s Integer system.
(Edit: There is no operation for Integer /, so it does not affect the consistency.)

This leads to overlooking the overflow in the multiplication of, for example, N2f14 numbers, but N0f8 and N0f16, which are often used in image processing, do not occur the overflow in the first place.


Option 3. check everything

operation Fixed Normed
+, - checked checked
* checked checked
/ checked checked
div, rem etc. checked checked

Like Rust, all operations are checked by default. This guarantees the safety in terms of overflow, but it causes a non-negligible slowdown in the addition and subtraction. This may require the end-users or the package developers to manually check the overflow safety and replace the operations with wrapping_*.

This option is annoying if you value the interactivity in rapid development over the safety.

Option 4. at your own risk

operation Fixed Normed
+, - wrapping wrapping
* wrapping wrapping
/ wrapping wrapping
div, rem etc. wrapping wrapping

This is the easiest option to speed up. However, if speed is important to you in the first place, you should avoid the FixedPoint division. On modern x86 machines, floating-point division is not so slow.

Therefore, the advantage of this option is the consistency of the policy, and I don’t think it is practically different from Option 2.

I think the final decision is one of these four options. Of course, there are other configurations as well,though.

If the absence of “saturating” bothers you, please remember that this is Julia’s forum, not Matlab. :stuck_out_tongue_closed_eyes:

Option 5. promotion

operation Fixed Normed
+, - promote promote
* promote promote
/ promote promote
div, rem etc. promote promote

In the extreme version of this proposal, all math operations immediately call float on the values before implementing the operation. Thus N0f8 + N0f8 -> Float32, and similarly for the other operations.

A variant of this proposal is to promote to some other type, e.g., a BFloat16 or a widened fixed-point type (e.g., for + and - a signed 16bit type). For the latter to be practical, further operations on the widened type need to be closed (no more widening).

This proposal differs from the above in that all simple things “just work” at the cost of growing the representation size by a factor of 2 or 4. Complicated things (like taking the mean over thousands of images) will require special care to prevent overflow in all of these proposals.

This may or may not be on the table for 0.9, but is certainly worthy in the context of this discussion.


I think this is a really bad default. The basic reason is that this implementation makes other choices much harder. If you want widening, any of the other proposals can let users opt in explicitly. Furthermore promoting N0f64 to float would lose 12 bits of precision, or would be widened to a big float.

In other words, fixed-point numbers (on general-purpose CPUs) are best viewed as a storage format, not a computation format.


Yes, the and the overhead of converting to & from fixed-point is significant enough to make it worth discouraging as anything but a storage format. The Images.jl ecosystem is moving towards reliance on LoopVectorization for kernels, and it’d make life easier if most data were already in native float formats amenable to SIMD operations.

using Images, BenchmarkTools

v1 = rand(Gray{N0f8}, 512, 512);
v2 = rand(Gray{Float32}, 512, 512);

@btime mean($v1) # 48.395 μs
@btime mean($v2) # 24.966 μs

Something that would also help us make this decision is user stories. Do you use FixedPointNumbers outside of the context of colors or images? If so, do any of the high-precision types (like N0f64 mentioned above) come up in practice? Do you ever do math on these numbers? What aspects matter to you?

It’s essential to have mathematical and logical consistency, but that alone does not seem to be enough to guide this decision—all 5 of the above choices satisfy consistency. It could be fun to create a toy that can do all kinds of great demos, but my personal inclination is to bias this choice in favor of the real work that people use these to accomplish. Hence user stories will allow us to be sure we’re taking your use-case into consideration.


Not quite so simple; on Julia 1.5 and 1.6 I get the same time from both of those, and both convert to Gray{Float32} as they go. To prevent mean from doing something sneaky, this may be better:

julia> function naivesum(A)
           s = zero(eltype(A))
           @inbounds @simd for a in A
               s += a
           return s
naivesum (generic function with 1 method)

julia> @btime naivesum($v1)
  3.035 μs (0 allocations: 0 bytes)

julia> @btime naivesum($v2)
  16.653 μs (0 allocations: 0 bytes)

As you can see, there’s actually a cost to promoting to Float32. Of course, some kind of promotion is necessary to get the right answer here.

EDIT: you can compare various strategies with

julia> function naivesum(::Type{T}, A) where T
           s = zero(T)
           @inbounds @simd for a in A
               s += a
           return s
naivesum (generic function with 2 methods)

julia> @btime naivesum(Gray{N24f8}, $v1)
  8.226 μs (0 allocations: 0 bytes)

julia> @btime naivesum(Gray{Float32}, $v1)
  13.343 μs (0 allocations: 0 bytes)

which is one of the reasons I’ve wondered above promoting to a wider FPN type.

1 Like

N0f64 (or N1f63) could be useful for special functions of they were accurate enough. They have enough bits that they are a viable method to get .5ulp implementations. Currently though they are too slow to be useful. I’m not sure if that’s inherent.

1 Like

I’d be interested in your tests on this, and in hearing some actual numbers more specific than “too slow.” Make sure you try the master branch of FPN, as @kimikage has been doing wonders for the performance of FPN lately.

BTW, the wide-width Normeds (e.g. N24f8, N56f8) are also useful as accumulator type in image processing. When calculating the average value of pixels, not converting them to float gives better results in terms of accuracy and speed.

1 Like

Opting in is always possible with the right types. You can’t live in a world where N0f8 + N0f8 -> N8f8, N8f8+N0f8->N24f8, etc. You can widen once but then you have to stop. Hence in one of the proposals above we need WN0f8, meaning "already-widened N0f8". Presto, anyone who wants to opt-in to not widening can just specify a result type of WN0f8.


Oh, I’ll check it out! The operations I care most about are fma and float conversion.

1 Like

Hopefully this discussion has clarified for people that these are not entirely exclusive: for option 5, unless we ban arithmetic altogether we also need one of the other options on the already-widened types. But we still have to choose which one.

It is also important to note that the operations of N0f8 and Gray{N0f8} can be defined independently. Of course, it is an excellent advantage that they can be handled equivalently, though.

For RGB{N0f8}, the cost of checked_add etc. can be reduced slightly. (cf. https://github.com/kimikage/CheckedColorArithmetic.jl/blob/46c9615371724acb0493d886b703f224e18357b5/src/CheckedColorArithmetic.jl#L24-L28)

Just FYI, some Fixed (not Normed) types are supported by some ARM CPUs. Also, Intel has started to support BFloat16 with AVX512_BF16. What is a storage format or what is a computation format is not “fixed”. :smile: