Option 5. promotion
operation | Fixed |
Normed |
---|---|---|
+ , -
|
promote | promote |
* |
promote | promote |
/ |
promote | promote |
div , rem etc. |
promote | promote |
In the extreme version of this proposal, all math operations immediately call float
on the values before implementing the operation. Thus N0f8 + N0f8 -> Float32
, and similarly for the other operations.
A variant of this proposal is to promote to some other type, e.g., a BFloat16
or a widened fixed-point type (e.g., for +
and -
a signed 16bit type). For the latter to be practical, further operations on the widened type need to be closed (no more widening).
This proposal differs from the above in that all simple things “just work” at the cost of growing the representation size by a factor of 2 or 4. Complicated things (like taking the mean over thousands of images) will require special care to prevent overflow in all of these proposals.
This may or may not be on the table for 0.9, but is certainly worthy in the context of this discussion.