[RFC] What should the arithmetic within the FixedPointNumbers be

Option 5. promotion

operation Fixed Normed
+, - promote promote
* promote promote
/ promote promote
div, rem etc. promote promote

In the extreme version of this proposal, all math operations immediately call float on the values before implementing the operation. Thus N0f8 + N0f8 -> Float32, and similarly for the other operations.

A variant of this proposal is to promote to some other type, e.g., a BFloat16 or a widened fixed-point type (e.g., for + and - a signed 16bit type). For the latter to be practical, further operations on the widened type need to be closed (no more widening).

This proposal differs from the above in that all simple things “just work” at the cost of growing the representation size by a factor of 2 or 4. Complicated things (like taking the mean over thousands of images) will require special care to prevent overflow in all of these proposals.

This may or may not be on the table for 0.9, but is certainly worthy in the context of this discussion.

4 Likes