Posit Standard (2022): Floats/IEEE vs Posits, and ternary math

The spec is only 12 pages, much shorter than for IEEE (while it references it), and the similarities and differences with IEEE floats are intriguing:

6.5 Conversions between posit format and IEEE Std 754™ float format
[…] NaR converts to quiet NaN […]

v0.5 respects the 2022 Standard for Posit Arithmetic but drops quire support. v0.4 implements the previous standard, has quire support but depends on the C implementation of SoftPosit.

The standard describes posit8 as the smallest “example”, but I believe you can (and want?) to have smaller. At least I think that size may be very valuable (and e.g. posit4, maybe unsigned?, would only need 64-bit quire [operations]), and better than Float8 (available in a package), and [b]float16 in some cases.

@Oscar_Smith @stevengj What caught my eye:

5.7 Functions of three posit value arguments
fMM(posit1, posit2, posit3) returns posit1 × posit2 × posit3, rounded.

[unlike IEEE, not defined there unless I missed something]

and its footnote:

Because multiplication is commutative and associative, any permutation of the inputs will return the same rounded result.

[And note, there’s no FMA on three posit inputs, similar to FMA with 3 floats in IEEE. But there is “FMA”, when one input of 3 is from the quire, so I think it amounts to the same.]

Should Julia support fMM (and Posits) at some point? I’m trying to think why is it defined, anyone know, or wants to guess? I suppose it can be an important operation, and while rounded, it must have something to do with the quire which I think is a type of fixed-point accumulator with guard bits to prevent overflow:

quire value A real number representable using a quire format described in this standard, or NaR.
quire sum limit The minimum number of additions of posit values that can overflow the quire format.

Note, this footnote:

The product of two posit values in a format of precision 𝑛 is always exactly expressible in a posit format of precision 2𝑛, but the quire format obviates such temporary doubling of precision when computing sums and differences of products. Sums of posit values using the quire are guaranteed exact up to 2^23+4𝑛 terms, per the quire sum limit

to:

Quire format can represent the exact dot product of two posit vectors having at most 2^31 (approximately two billion) terms without the possibility of rounding or overflow.

Note, the minimum “quire format precision” is 16𝑛 bit (changed from nbits^2/2 in the draft standard), e.g. 128 bits for posit8 (no longer 32 bits), up to 512 bits for posit32, and no longer 2048 bits for posit64 (would be 1024 but posit64 is dropped as an example since the draft, maybe not too useful or needed format?).

fused rounded only after an exact evaluation of an expression involving more than one operation.

5.11 Functions involving quire value arguments
With the exception of qToP which returns a posit value result, these functions return a quire value result.13
[…]
qSubQ(quire1, quire2) returns quire1 − quire2.
qMulAdd(quire, posit1, posit2) returns quire + (posit1 × posit2).
qMulSub(quire, posit1, posit2) returns quire − (posit1 × posit2).
qToP(quire) returns quire rounded to posit format per Section 4.1.

Other functions of quire values may be provided, but are not required for compliance. They may be required in a future revision of this standard.

Discussion in footnote 13, on e.g. complex-values, and elsewhere is intriguing.

NaR Not a real. Umbrella value for anything not mathematically definable as a unique real number.

Intriguingly unlike:

julia> NaN == NaN
false

A test of equality between NaR values returns True.

exception A special case in the interpretation of representations in posit format: 0 or NaR.
Posit representation exceptions do not imply a need for status flags or heavyweight operating system or language runtime handling.

regime The power-of-16 scaling determined by the regime bits. It is a signed integer.

next(posit) returns the posit value of the lexicographic successor of posit’s representation.8
prior(posit) returns the posit value of the lexicographic predecessor of posit’s representation.9

the footnote a bit scare (depending on what “if necessary” mean here):

wrapping around, if necessary

6.2 Conversions involving quire values
A posit compliant system only needs to support rounding from quire to posit values and conversion of posit to s in the matching precision, per Section 5.11.

2 Likes

IMO, the Posit standard is mostly really good. The removal of Inf -0. and all but one of the NaN values is a really good one. fMM is somewhat interesting, but IMO, not very useful. Quires are a great idea for low precision, but I think the lack of fma is very sub-optimal. It makes some sense since Posits are bad for extended precision (double-double) math since they don’t have some of the nice properties, but fma still would be really useful for things like polynomial evaluation. Quires are decent, but they’re big enough that I doubt we’ll ever see a chip with vectorized quire support. It just would take too much register space.

In terms of Julia support, there’s not much to do until processors start supporting Posit, which still seems a ways off.

4 Likes

I just added a link to the (pure) Julia implementation. Note the standard has FMA in some form, with the quire. SoftPost.jl dropped the quire however… but it’s available in the older 0.4.

Why not? I mean there must be a reason it’s mandated in the standard. Something I’m missing. I can see power of three (or higher), useful for [Taylor] polynominals, but that’s binary, not ternary. p x p x p is more general, it must pop up in some important equation I’m not aware of, or/and the rounding of it important.

Nvidia GPUs already have some accumulator. @Elrod I think vectorized (in software) 4- or 8-bit posits might be helpful… I suppose you need the largest quire of your posits if you mix types, i.e. then for those lower-precision too. I think in hardware the registers would only have the sizes you want, e.g. 32-bit for posit32, and only the FPU (PPU? PAU) would need a quire, maybe one each. I’m not so sure 512 bits is too bad (per FPU) in hardware, but yes, slower in software, not too bad though? But you’re competing with Float32/64 (double-double that much used?), even [b]float16, so I think to compete in software you would want to go lower to posti8 or less (possibly times two for intervals).

2-bit to 32-bit with two exponent bits (posit_2_t) → Not fast : Using 32-bits in the background to store all sizes.

@milankl 2-bit Postis I see there are likely overkill… more proof-of-concept, and posit4 I mentioned, might neither be too useful, I was just thinking the software implementation would be really fast, and if needed 4-bit posit x 4-bit posit (or any other binary operation), only needing a 256 byte lookup table for each operation, and a quire fitting in 64-bit register. There’s though no need to avoid 128-bit int with next step, posit8, just requires 2 registers and carry.

I’m aware of your (other) work, e.g. (wrapper):

202x compression of floats [arrays of] is great, I didn’t look into too carefully, but I suppose you need to decompress in chunks, not one by one, so you can’t calculate on directly. It’s unclear if this defeats the purpose of posits or similar would just be built for it (on posit32, so possibly half the compression… down to posit8, 25x?).

Number Formats, Error Mitigation, and Scope for 16-Bit Arithmetics in Weather and Climate Modeling Analyzed With a Shallow Water Model https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2020MS002246

2.1.3 The Posit Number Format

Posit numbers arise from a projection of the real axis onto a circle (Figure 1), with only one bit pattern for zero and one for Not-a-Real (NaR, also called complex infinity)

I’m not sure if complex numbers are used in your (climate modeling) application(s). I don’t see that (as opposed to “more complex models”). But complex of posits (Euclidean) are of course possible. I think though a polar form might be valuable, with the angle just in straight fixed-point, likely 4-bits not useful for it but 8 might often be enough (and better usage of bits than posit8 or Float8), for 8-bit posit plus 8-bit fixed-point angle?

Just to clarify: For SoftPosit.jl v0.5 I wanted to drop the dependency on SoftPosit-C as it’s barely maintained and I rather wanted SoftPosit.jl go its own way, which turned out to be faster and hopefully in the future also quicker to add new featuers. However, I didn’t have the time to implement quires, neither did I ever really ended up using them. If anyone wants to implement quires, I’m happy to help, but don’t want to take the lead.

Similar for other off-standard posit types. Sure, Posit2, 4, 6 are all possible. As far as I know research that uses Float8 often redefines the NaNs as they are a massive waste of bitpattern for such low precision and/or drop subnormals for a simpler implementation. So that clearly points to a better use of posits at such low precision. But again, the hardware question remains, unless someone implements posit arithmetic on an FPGA. Same here, for precision tests one could implement Posit2 or 4 (Posit8 already exists) but I personally don’t have any application for that, meaning I’m happy to help but I won’t do it myself.

4 Likes

Are they (or other larger-than-double formats) needed? Nobody has measured anything that accurately.

I noticed the quire is changed in the standard, to linear in size from quadratic in the draft (breakeven is 512 bits for postit32). In fact posit64 is no longer mentioned, as maybe not important?

Double has: “53 bits (52 explicitly stored)” of fraction length, before posit32 was mentioned to have 28 max significand bits now “0 to 27 bits” fraction length; and posit8 had 8 max, now “0 to 3 bits”.

I see in the draft:

Note: If hardware supports posit multiplication, addition, subtraction, and the quire, all remaining
functionality can be supported with software.

That exact phrase is dropped (but still true?!), now only find, when searching for software:

A posit compliant system may be realized using software or hardware or any combination.

Complex numbers arise in climate modelling mostly through Fourier and other related transforms (spherical harmonic transforms for example). Meaning while input and output fields are real, one may chose a numerical methods that uses complex numbers in between. I thought about this idea to use posits for complex numbers directly given that they arise from a projection onto the circle, but never got really far.

Thinking out loud: Instead of representing a complex number z = x +iy as a sum of a real and imaginary part which can both be represented as float/posit/whatever, one can think about z = mexp(phii) with a magnitude m and an angle phi. If we were to represent z with posits in a magnitude, angle sense, then one needs first an unsigned posit for m (bit annoying to change the posit definition, but lets ignore that) probably want phi = reinterpret(Unsigned,p) with p a posit number to represent the angle. This is because posits are equi-angle distributed around the circle. Multiplication for two such zs is easy z1z1 = m1m2*exp((phi1+phi2)*i), so you’d just multiply the magnitudes with posit arithmetic and add the angles with integer arithmetic. However, what to do with addition (and subtraction)?

z1 + z2 = m1*cos(phi1) + m2*cos(phi2) + i(m1*sin(phi1) + m2*sin(phi2))

So we’d need a handy shortcut to approximate/calculate something like m*cos(phi) which I never got my head around how to do. John’s notebook pdf has a chapter how posits can approximate the sigmoid function, but if anyone has an idea how to approximate cos/sin please let me know. I’d be happy to implement a Complex{Posit} type based on magnitude and angle.

2 Likes

Yes, that’s annoying, but possible to do slowly (and e.g. with a lookup table, for small posits). I’m thinking you might make up the slowness, by using Julia’s type system.

Usually operations on types give the same type, but that’s not needed (and e.g. division is already an exception in Julia), so you could have e.g. multiply of PolarComplexPosts give the same type, but adding them give Euclidean. Then you defer changing back to polar, i.e. using hypot2. Maybe the quire helps here (qToP might change back to polar). I’m not sure how it was represented in software (or in hardware).

You could do e.g. 24-bit mag/8-bit angle = 32-bit, not just 8/8.

Note, GitHub - JuliaMath/FixedPointNumbers.jl: fixed point types for julia

keep in mind that they are vulnerable to both rounding and overflow.

but you want the overflows anyway, and I’m not sure rounding is a problem either, since you already get Float64 out (53 bits of significand precision, more than the max 27 of posit32, even close to the 59 of posit64):

x = N0f8(0.8)
julia> sin(x)
0.7173561f0

A.

World’s first family of multicore CPUs based on POSIT number system.

The have on their website e.g. stuff like “Facebook started using POSIT in their AI hardware”, and link to research in B.

I’m intrigued if that only means GPGPU (or actual graphics):

RacEr- the world’s first POSIT for Graphics Processing Unit (GPU). RacEr is shipped with highly silicon efficient architecture.

SupersoniK- the ultimate multicore CPUs for High Performance Computing (HPC) applications with accuracy exceeding that of double precision floating point numbers.

Their site has an FAQ and they claim e.g. (so likely support Posit16 and Posit32; also Posti8?):

POSIT number system require half data width with respect to IEEE-754 FP. For example, POSIT require 16 bits to compute 32 bits accuracy of equivalent IEEE-754 single precision FP, POSIT require 32 bits to compute 64 bits accuracy of equivalent IEEE-754 double precision, etc.

Literature review
In order to overcome above many adders are invented such as ripple carry adder (RCA), carry save adder (CSA), carry look-ahead adder (CLA), carry skip adder (CSkA) and carry select adder (CSelA) [1]-[7].
[..]
With 10 years of R&D efforts we were finally able to crack carry propagation problem puzzle. VividSparks CFA technology detects carry bits in advance by just looking at input bit patterns. Carry Detection Unit has only logic depth of 6 CMOS transistors. Following figure 1 shows over all CFA architecture [..]

Further Applications of VividSparks CFA Technology

  • Leading 0’s and 1’s detection in arithmetic circuits
  • Booth multiplier CSA accumulation
  • Booth decoding of multiplier
  • Odd multiple generation in Booth multipliers for 3X, 5X, 7X and so on
  • Comparators
  • Branch prediction operations
  • ALU operations
  • Quire operations in POSIT and
  • many more applications!

B.
Note the claim about Facebook using posits, maybe be about “log (posit tapered)” i.e. “(8, 1, 5, 5, 7) log ELMA” in Table 2, not regular posits, as in the standard. It’s unclear to me it’s better than “(8, 1) posit” or “(9, 1) posit” in the table, but likely slightly better than “(8, 2) posit EMA”, which I think is the Posit standard format.

Rethinking floating point for deep learning

https://arxiv.org/pdf/1811.01721.pdf

We improve floating point to be more energy efficient than equivalent bit width integer
hardware on a 28 nm ASIC process while retaining accuracy in 8 bits with a novel
hybrid log multiply/linear add, Kulisch accumulation and tapered encodings from
Gustafson’s posit format. With no network retraining, and drop-in replacement of
all math and float32 parameters via round-to-nearest-even only, this open-sourced
8-bit log float is within 0.9% top-1 and 0.2% top-5 accuracy of the original float32
ResNet-50 CNN model on ImageNet.

Typical floating point addition is notorious for its lack of associativity; this presents problems
with reproducibility, parallelization and rounding error [26]. Facilities such as fused multiply-add
(FMA) that perform a sum and product c + aibi with a single rounding can reduce error and further
pipeline operations when computing sums of products. Such machinery cannot avoid rounding error
involved with tiny (8-bit) floating point types; [..]

There is a more efficient and precise method than FMA available. A Kulisch accumulator [19] is
a fixed point register that is wide enough to contain both the largest and smallest possible scalar
product of floating point values ±(f 2 max + f 2 min). It provides associative, error-free calculation
(excepting a single, final rounding) of a sum of scalar floating point products; a float significand
to be accumulated is shifted based on exponent to align with the accumulator for the sum. Final
rounding to floating point is performed after all sums are made. A similar operation known as
Auflaufenlassen was available in Konrad Zuse’s Z3 as early as 1941 [18], though it is not found in
modern computers.

We will term this operation of summing scalar products in a Kulisch accumulator exact multiply add
(EMA). [..] Both EMA and FMA can be implemented for any floating point type. Gustafson
proposed Kulisch accumulators to be standard for posits, terming them quires.

Depending upon float dynamic range, EMA can be considerably more efficient than FMA in hard-
ware. FMA must mutually align the addends c and the product ab, including renormalization logic
for subtraction cancellation, and the proper alignment cannot be computed until fairly late in the
process. Extra machinery to reduce latency such as the leading zero (LZ) anticipator or three path
architectures have been invented [28]
[..]
5 Multiplier efficiency
Floating point with EMA is still expensive, as there is added shifter, LZ counter, rounding, etc. logic.
Integer MAC and float FMA/EMA both involve multiplication of fixed-point values; for int8/32
MAC this multiply is 63.4% of the combinational power in our analysis at 28 nm (Section 8).

A logarithmic number system (LNS) [16] avoids hardware multipliers entirely, where we round and
encode logB (x) for some base B to represent a number x ∈ R. Hitherto we have considered linear
domain representations [..]

We will name this (somewhat inaccurately) exact log-linear multiply-add (ELMA). The log product
and linear sum are each exact, but the log product is not represented exactly by r(p(f )) as this
requires infinite precision, unlike EMA which is exact except for a final rounding.

Table 2: ResNet-50 ImageNet validation set accuracy per math type
[..]
This encoding we will refer to as (N, s, α, β, γ) log (posit tapered)

I think we can make posits faster than regular floating-point, in software, on regular CPU, like x86 and ARM (for the vast majority of operations, most that matter?).

The trick would be to rely on Julia’s type system to help, by relying on from (code comment) “Posit8, 16 are subsets of Float32” and “Posit32 is a subset of Float64”.

Right now:

julia> Posit8(1)+Posit8(1)
Posit8(2.0)

but if we defer rounding and converting back to Posit8 we can just use Float32 for the calculations:

julia> Posit8(1)+Posit8(1)
Posit8_as_Float32(2.0)

The quire at 128 bits for Posit8 is irrelevant (for at least that operation), because most of the defined arithmetic isn’t defined to return (or use the quire as inputs, or it seems as intermediate) such a result.

When you load from memory you have to decode the Posit[8], yes, loading as from memory is slow anyway, and the 2x-4x more bandwidth requirement of regular floats should make up for it.

Then you’re just doing the same float calculations there-after, hopefully a few before you need to store to memory, and then you need to round. I guess you do not round as often then, so will not get bit-identical results, not standard-conforming, but I think the software solution should be more accurate with less rounding.

A similar trick could actually also be used instead of (where/since it was thought of a storage-format, while that is changing and CPUs starting to support):

julia> Float16(2)+Float16(2)
Float16(4.0)

An potential option for (these new) posits, and well floats too, is:

It’s currently slow, because you’re forced to do it each time.

In general, if you multiply say 8-bit by 8-bit you get 16-bit, doubling number of bit guarantees you don’t need any rounding (unless you do repeatedly, and FPUs rounds anyway).

If you however add (or sub) two 8-bits you need only 1 extra bit, assuming you’re working on integers (or fixed-point), not floats. For floats you have possible catastrophic cancellation (with sub), but I don’t see it getting any worse with my idea. The quire is, it seems, to mitigate that, but you need to opt into using it (and it’s not longer supported in the pure-Julia SoftPosit.jl).

If you want the quire, there are only the 10 operations to consider (though likely also used indirectly in “5.5 Elementary functions of one posit value argument” too, or avoidable?):

5.11 Functions involving quire value arguments

to support. And it seems, it needs not be too slow to support. And only a problem if you use it often; e.g. in a loop.

The only function there seems not bad, as will fit in a CPU float:

5.7 Functions of three posit value arguments
fMM(posit1, posit2, posit3) returns posit1 × posit2 × posit3, rounded.

Apparently there’s a competitor:

Beating Posits at Their Own Game: Takum Arithmetic

The author, Laslo Hunhold, also maintains a Julia package and a C package on Github:

2 Likes

I’ve just given that a quick read. It looks like quite a nice format.

2 Likes

Thanks, didn’t know of this (nor about further work on making Posits logarithmic).

I also see:

This repository contains a complete decoder and encoder for both the default logarithmic and linear Takum arithmetic formats, implemented in VHDL.

Hardware seemingly will make this fast, but addition seems now more complex than multiplication, because you can’t just shift?! I.e. I’m guessing it’s needs to multiply too, unless exponents match. Not just more complex for logarithmic format, where it’s expected more complex.

A significant advantage of logarithmic number systems overall lies in the singular focus required within FPU design … Posits employing logarithmic significands have previously been explored and found to be well-suited for neural networks [14]. Notably, posits and takums with logarithmic significands offer the advantageous feature of perfect invertibility for every number

I’m no longer optimistic for Posits (or logPosits) for neural networks, with FP4 and FP6 adopted (and NF4 etc.), and those even outdated, with upcoming ternary/integer-based networks.

With Takums having 4 fields in the takum encoding, and just the sign, direction, and regime bits always 5 bits total, so you can’t compete (as is) with FP4 (or FP6?), even when there are no characteristic bits (they range from 0 to 7), or no mantissa bits. I suppose they can be 1+ or probably even 0 mantissa bits i.e. ghost bits.

I suppose you can amend the structure of takums to have 2 or 1 regime bits, maybe skip the direction bit… you can even skip the sign bit (as some float formats are doing), but you don’t have a lot of room at FP4 or FP6 sizes so if you skip those bits, you are degenerating to (small) [IEEE] floats…

All of these considerations prompted the default definition of takums with a
logarithmic significand and the format is so tailored for being a logarithmic num-
ber system that defining it with a linear significand makes little sense, mostly
because of the base √e that cannot properly translate into bit-shifts for arith-
metic. If we do it anyway,

Additionally, we introduce a takum colour scheme, prioritising uniformity
in both lightness and chroma within the perceptually uniform OKLCH colour
space [19]. Detailed colour definitions are delineated in Table 3.

[see paper, table doesn’t display well here:]

Table 3: Overview of the takum arithmetic colour scheme.
colour identifier OKLCH CIELab HEX (sRGB)
sign (50%, 0.17, 25) (40.28, 53.78, 33.31) #B02A2D
direction (50%, 0.17, 142.5) (43.91, −45.71, 46.68) #007900
regime (50%, 0.17, 320) (39.29, 46.60, −39.19) #8C399E
characteristic (50%, 0.17, 260) (40.34, 10.54, −59.37) #1F5DC2
mantissa (50%, 0.00, 0) (42.00, 0.00, 0.00) #636363

OKLCH not to be confused with (what I had neither heard of, that it’s based on, similar I think):

Those and the takum color schemes are intriguing, need to look into more… might be a good way to save on bits (possibly more for compression rather than real-time graphics). I’m still unsure if that’s the main problem in graphics, as in for the framebuffer, so much else needs to happen, and is GPU and AI limited.

I see in the code:

# conversion from BigFloat
Takum8(f::BigFloat) = Takum8(Float64(f))
Takum16(f::BigFloat) = Takum16(Float64(f))
Takum32(f::BigFloat) = Takum32(Float64(f))
LogTakum8(f::BigFloat) = LogTakum8(Float64(f))
LogTakum16(f::BigFloat) = LogTakum16(Float64(f))
LogTakum32(f::BigFloat) = LogTakum32(Float64(f))

function Takum64(f::BigFloat)
	if iszero(f)
		# truncated() is buggy when the input is zero
		return zero(Takum64)
	else

I believe this is double rounding, usually a bit of a flaw. But I’m unsure here, maybe not in the rounding they do (and since this is not a base-2 system, even in the linear format):

In terms of rounding, we adhere to the posit standard [11, Section 4.1] by employing ‘saturation arithmetic’

Since I’m Icelandic this was intriguing (umfang is not strictly wrong, I would likely find another word…):

The term ‘takum’ originates from the Icelandic phrase ‘takmarkað umfang’, translating to ‘limited range’. Pronounced initially akin to the English term ‘tug’, with a shortened ‘u’ sound, the ‘g’ is articulated as a ‘k’. The ‘-um’ follows swiftly after the ‘k’, pronounced akin to ‘um’ in the English term ‘umlaut’.

Caveat: I’m no expert.

Unless you have large, deep calculations (e.g, large linear algebra) then speed and memory aren’t really issues and you might as well use BigFloat for all it matters. But for these big calculations, I worry that these posits/takums introduce too many sharp edges to ever find mainstream use.

One of my biggest concerns with these tapered float formats is the scale-dependent variability of accuracy. While numbers “close to 1” are more common in the physical world, intermediate calculations are another beast. Yes, floating point numbers can suffer from catastrophic cancelation. But tapered formats suffer from this as well as “catastrophic scaling”, where a*b / a can be significantly different from b. E.g. Takum16(5.5) * Takum16(1e60) / Takum16(1e60) == Takum16(5.602).

The replacement of Inf with saturating overflow in these tapered formats is a horrifying misfeature. Yes, saturating overflow isn’t literally the worse if it happens at the end of a calculation. But in any intermediate result it is devastating. E.g., Takum16(1e50) * Takum16(1e50) / Takum16(1e50) / Takum16(1e50) == Takum16(5.687e-24). Hopefully there would at least an exception flag set? Though we know how closely everyone watches IEEE754 exceptions…

I haven’t read about them in a while so have forgotten and misremembered much, but (by my recollection) quires sound like a reinvention of the (failure that was the) x87 extended floating point unit. Yes, it had all sorts of ergonomic issues, but if anyone thought it was still a good idea (and that those issues were overcome-able) those could have been improved. Quires seem like a bandaid trying to cover fundamental issues of tapered formats and I don’t know that we should expect them to be any more relevant than x87.

4 Likes

Isn’t this a bit unfair? I mean if/when you’re using smaller versions you should compare to similar sized:

julia> Float16(5.5) * Float16(1e60) / 1e60
Inf

julia> Float16(5.5) * Float16(1e60) / Float16(1e60)
NaN16

julia> Float32(5.5) * Float32(1e60) / Float32(1e60)
NaN32

julia> Float64(5.5) * Float64(1e60) / Float64(1e60)
5.5

julia> Takum64(5.5) * Takum64(1e60) / Takum64(1e60)
Takum64(5.5)

julia> Takum32(5.5) * Takum32(1e60) / Takum32(1e60)
Takum32(5.50000036)

julia> LogTakum64(5.5) * LogTakum64(1e60) / LogTakum64(1e60)
LogTakum64(1)

Maybe you actually rather want the NaN[16]? While they emphasize the sufficient range, I think, yes, this depends on you scaling your inputs to be of similar units, as you can do, otherwise less accurate. 5.602 is still a close approximation to 5.5… better than NaN, even 1 is… When you scale your inputs to be closer to 1 then you get more out of your bits, and might be able to use Takum8 when you otherwise would use [b]float16?

What do you do when you get an Inf with regular single or double? You might be screwed, if you don’t have the source code. Your only option is to recompile with larger Float64/double. I suppose you could opt into overflow exception with this format, at least in debug mode if not for production. When you’ve made it not happen then is it better? If you then rerun your code with other inputs and it actually happens, then you tolerate a bit more. Same with underflows? Maybe they are even worse, in IEEE you get 0.0, I think underflow (some condition code? no one checks?), then later, or much later, divide by zero maybe…

My point was not about the specific values or bitwidths. We all know that uncompensated floating point sums are prone to catastrophic cancelation. But tapered floats also extend this to uncompensated products. My example caused a 2% error with 2 operations (and could have been more by forcing the numbers closer to the limit). With 754-style floating point numbers, the product of N values (that does not under/overflow) is typically within something like ~N ULP of the correct answer (i.e., you’d need millions of multiplications to get a similar error in Float32, maybe a bit less with some very adversarial values).

If one wants to “center” their tapered floats near 1 for maximum accuracy, they’re doing a manual implementation of a floating point. The scale-invariant precision of 754-style floats is a really nice feature.

I suppose I could counter myself by arguing that 754’s effectively always take their worst case precision, but at least it’s a large precision. I read the takum paper a couple of days ago now, but I do vaguely recall that they do at least have some minimum significand which I do appreciate.

I want the Inf. Takum16(1e50) * Takum16(1e50) / Takum16(1e50) / Takum16(1e50) == Takum16(5.687e-24) is a really bad answer because it’s nowhere close to correct and has low chance of noticeably tainting downstream operations. An analogous 754 operation would yield Inf which is more suggestive of an issue. It will take me much longer to notice that the takum calculation failed and to locate it.

When do you want the Inf (or NaN/NaR)? If right away, I don’t see a reason you can’t have exceptions. Not just for debug mode but for production too. I’m not sure an Inf, as in the encoding, is very useful. Isn’t it just annoying to get them at the end of your calculations, maybe hours after, and hugely expensive possibly, and not knowing at all where the cause is? Takums and Posits have only one NaN bitpattern, then called NaR, while IEEE has millions of them. A wast unless you use them as a payload rarely used, at least not a possibility for Infs.

1e50 is a huge number close to the range they support (and it’s argued you need not go much higher). 1e50 squared is already outside of the range they support (except by the saturation, yes, I see your point, maybe it’s just bad).

While valid arguments support the inclusion of either NaR or ∞, the underlying
universal wheel algebra [4] is defined with both elements (a bottom element
⊥ := 0/0 and infinity ∞). However, including both elements is impractical as
it would compromise the symmetry of the model. The handling conventions for
NaR are further discussed in Section 4.6.

The design criteria is:

1b) The dynamic range should be reasonably bounded on both ends as the bit
string length approaches infinity.

Like:

Posits never overflow to infinity or underflow to zero, and “Not-a-Number” (NaN) indicates an action instead of a bit pattern. A posit processing unit takes less circuitry than an IEEE float FPU.

There are no “NaN” (not-a-number) bit representations with posits; instead, the calculation
is interrupted, and the interrupt handler can be set to report the error and its cause, or invoke
a workaround and continue computing, but posits do not make the logical error of assigning
a number to something that is, by definition, not a number. This simplifies the hardware considerably. If a programmer finds the need for NaN values, it indicates the program is not yet finished, and the use of valids should be invoked as a sort of numerical debugging environment to find and eliminate possible sources of such outputs. Similarly, posits lack a separate ∞ and −∞ like floats have; however, valids support open intervals (maxpos, ∞) and (−∞, −maxpos), which provide the ability to express unbounded results of either sign, so the need for signed infinity is once again an indication that valids are called for instead of posits. There is no “negative zero” in posit representation; “negative zero” is another defiance of mathematical logic that exists in IEEE floats.

An AI answer:

posits have only one infinity representation, which is denoted by a sign bit of 1 followed by all zeros

I feel I remember it too from the spec… [EDIT: Maybe it was for Unums, thus AI got it wrong, or it was thinking of valids, that could also be supported by Takums?] so Gustafson found it important enough (or it just fit the design) to support for their Posits (no separate negative infinity though). For complex number etc. this is more strictly with infinities in each quadrant, or actually infinity many infinites. Do you find it important to get an infinity in encoding or not, and if how specific actually?

The use of exception handlers (especially floating point exceptions) in “typical programs” has become increasingly rare. Julia doesn’t even expose a way to check the floating point exceptions or set handlers and I don’t think it’s alone. I don’t see exceptions as a likely recourse without a massive paradigm shift (a big price for the sake of a slightly different encoding that most users would not notice).


The multiple encodings of NaNs really don’t bother me. If you replaced them with finite numbers, you could increase the dynamic range by a factor of 2. I just don’t see a maximum of 6.8e38 vs 3.4e38 as what Float32 was missing all this time (though I would guess the extra range would extend the minimum rather than maximum, in practice). I would only care slightly more in the case of Float16, but I don’t really care about Float16 (or smaller) in the first place.

I really can’t get behind this criterion they suggest. If I gave you 500 or 10000 bits to use, why would limiting yourself to 10^{55} be a feature? Thankfully (I either misunderstand the intent completely or) I don’t see it as central to posits and it’s probably not central to takums either. So it doesn’t bother me.

Of Gustafson’s criteria, the only one I actually thought IEEE754 would benefit from improving is the one about statelessness. I.e., it should be possible to control the floating point environment on a per-instruction level rather than via global settings. That’s hardly a novel observation and I hope that eventually we do get per-operation controls. None of the other criteria actually convinced me they were necessary and some didn’t even seem useful.


Personally, I like having infinity as an indicator of overflow, as a sentinel, as an initialization, or as the mathematically correct return from functions (like csc(+0.0)). I like how IEEE754 can usually indicate an under/overflow issue in value-space, because much of the time it isn’t even an issue but the answer you would have wanted anyway . Underflow to 0 is seldom an issue one would resolve any other way and an intermediate overflow to Inf sometimes works out to a finite and accurate final result too.

Signs on zeros and infinities were included in 754 as features and (despite that they can occasionally be a little tedious to address) I accept them as such. I wouldn’t discard them lightly.


I do have some sympathy for the idea of representing rounded/truncated results via intervals with valids, but such results can’t really survive multiple operations without using a pair of values anyway (which is easy to emulate) so it’s not obvious to me that these are practical. E.g., (1,\text{nextfloat}(1)) - 1 does not fit neatly into the interval (0, \text{nextfloat}(0)) so you’d be stuck representing both bounds independently. But I probably misremember big parts of that spec so my complaints here might not even make sense.

1 Like

I didn’t see it as exceptions you would trap, just crash the program. It should be very rare. Even for the smaller takums since they have same range. Unlike Postis, where under/overflow is then a real danger.

Overall, it is imperative for the reader to bear in mind that takums offer an almost 50% broader dynamic range than bfloat16 and an even more substantially broader dynamic range than posit8 and posit16. Despite this, takums consistently outperform or match posits and bfloat16 in all scenarios except for addition and subtraction.

Under/overflows are most dangerous when you multiply rather than add (or even more likely with powers), when you overshoot by a lot. By definition addition/subtraction will only add one mantissa bit, not likely to go outside of your range, or if only max doubly so. I think then clamping might be acceptable, as you are still not completely off… at least not the sign. And sometimes only knowing the sign of the calculation is useful. The signed Infs will yes give you that, right, but no way of knowing the scale. And yes, takums will gave you incorrect scale then, I haven’t thought this through, maybe it would recover, with following operations… I think you could and should have a counter, how often you clamp for adds/subs, and a separate one for mul/div, if you just don’t trap, and if the former is low (and latter is 0), then you have some sense of how far off in range you might be.

I can see that you would have an exception and a) trap on under/overflow, always as default. b) less safe, to opt into, allow clamping on adds/subs only, c) for debugged code, allow it also on other operations, or maybe only mul/div, if you want to risk that. I’m not sure the spec demands c, or even prefers it.

Yes, for BigFloat this doesn’t make sense, and you just keep using BigFloat. I see no reason to not using that in some cases e.g. in debug mode. You can also test all your results with it or Float64 for a sanity check, on the same machine assuming IEEE still supported. But takums are explicitly designed for “low-bit applications with high encoding efficiency for numbers distant from unity”.

How distant? Posits have smaller range, and variable by type, and takums improve on it, not IEEE. Yes, for from unity is better there, and you seem to like that, i.e. dislike tapered. But I think you overlook the advantage, the common case, when you are close to unity. Then takums may also be better than IEEE and Posits.

Yes, mean it has high chance of tainting, but low chance of detecting that tainted calculation. I believe such only happens if you overflow or underflow, and you do not stop the calculation.

I’m in the same camp. It’s an engineering decision. Infs cost in checking encoding on every operation in software or hardware. You still need to check for under/overflow, even if you have no Inf encoding. To clamp or trap.

julia> 1/Inf
0.0

julia> 1/-Inf
-0.0

julia> 1^NaN
1.0

NaNs are poisoning (in most cases I can think of, except for above), but Infs are not real values, do not exist except in computers, are only limits, and as above no longer poisoning. I think such cases might be dangerous.

You can initialize with NaR (their NaN), for takums and posits, and I’m not sure, but I suppose csc and similar functions would or could return NaR. The posit standard and I suppose takums only support primitive operations, so csc is outside of scope and you could, I think would, do such in the library. Good enough for you? I only see signed infs making sense with signed zeros. Do you need the sign there? I believe valids would provide, and for takums too.

AI answer may have googled and learned from incorrect websites like this one:

There are two exceptional posits, both with all zeros after the sign bit. A string of n 0’s represents the number zero, and a 1 followed by n-1 0’s represents ±∞.

Maybe the standard evolved and had this previously?