Posit Standard (2022): Floats/IEEE vs Posits, and ternary math

The spec is only 12 pages, much shorter than for IEEE (while it references it), and the similarities and differences with IEEE floats are intriguing:

6.5 Conversions between posit format and IEEE Std 754™ float format
[…] NaR converts to quiet NaN […]

v0.5 respects the 2022 Standard for Posit Arithmetic but drops quire support. v0.4 implements the previous standard, has quire support but depends on the C implementation of SoftPosit.

The standard describes posit8 as the smallest “example”, but I believe you can (and want?) to have smaller. At least I think that size may be very valuable (and e.g. posit4, maybe unsigned?, would only need 64-bit quire [operations]), and better than Float8 (available in a package), and [b]float16 in some cases.

@Oscar_Smith @stevengj What caught my eye:

5.7 Functions of three posit value arguments
fMM(posit1, posit2, posit3) returns posit1 × posit2 × posit3, rounded.

[unlike IEEE, not defined there unless I missed something]

and its footnote:

Because multiplication is commutative and associative, any permutation of the inputs will return the same rounded result.

[And note, there’s no FMA on three posit inputs, similar to FMA with 3 floats in IEEE. But there is “FMA”, when one input of 3 is from the quire, so I think it amounts to the same.]

Should Julia support fMM (and Posits) at some point? I’m trying to think why is it defined, anyone know, or wants to guess? I suppose it can be an important operation, and while rounded, it must have something to do with the quire which I think is a type of fixed-point accumulator with guard bits to prevent overflow:

quire value A real number representable using a quire format described in this standard, or NaR.
quire sum limit The minimum number of additions of posit values that can overflow the quire format.

Note, this footnote:

The product of two posit values in a format of precision 𝑛 is always exactly expressible in a posit format of precision 2𝑛, but the quire format obviates such temporary doubling of precision when computing sums and differences of products. Sums of posit values using the quire are guaranteed exact up to 2^23+4𝑛 terms, per the quire sum limit


Quire format can represent the exact dot product of two posit vectors having at most 2^31 (approximately two billion) terms without the possibility of rounding or overflow.

Note, the minimum “quire format precision” is 16𝑛 bit (changed from nbits^2/2 in the draft standard), e.g. 128 bits for posit8 (no longer 32 bits), up to 512 bits for posit32, and no longer 2048 bits for posit64 (would be 1024 but posit64 is dropped as an example since the draft, maybe not too useful or needed format?).

fused rounded only after an exact evaluation of an expression involving more than one operation.

5.11 Functions involving quire value arguments
With the exception of qToP which returns a posit value result, these functions return a quire value result.13
qSubQ(quire1, quire2) returns quire1 − quire2.
qMulAdd(quire, posit1, posit2) returns quire + (posit1 × posit2).
qMulSub(quire, posit1, posit2) returns quire − (posit1 × posit2).
qToP(quire) returns quire rounded to posit format per Section 4.1.

Other functions of quire values may be provided, but are not required for compliance. They may be required in a future revision of this standard.

Discussion in footnote 13, on e.g. complex-values, and elsewhere is intriguing.

NaR Not a real. Umbrella value for anything not mathematically definable as a unique real number.

Intriguingly unlike:

julia> NaN == NaN

A test of equality between NaR values returns True.

exception A special case in the interpretation of representations in posit format: 0 or NaR.
Posit representation exceptions do not imply a need for status flags or heavyweight operating system or language runtime handling.

regime The power-of-16 scaling determined by the regime bits. It is a signed integer.

next(posit) returns the posit value of the lexicographic successor of posit’s representation.8
prior(posit) returns the posit value of the lexicographic predecessor of posit’s representation.9

the footnote a bit scare (depending on what “if necessary” mean here):

wrapping around, if necessary

6.2 Conversions involving quire values
A posit compliant system only needs to support rounding from quire to posit values and conversion of posit to s in the matching precision, per Section 5.11.


IMO, the Posit standard is mostly really good. The removal of Inf -0. and all but one of the NaN values is a really good one. fMM is somewhat interesting, but IMO, not very useful. Quires are a great idea for low precision, but I think the lack of fma is very sub-optimal. It makes some sense since Posits are bad for extended precision (double-double) math since they don’t have some of the nice properties, but fma still would be really useful for things like polynomial evaluation. Quires are decent, but they’re big enough that I doubt we’ll ever see a chip with vectorized quire support. It just would take too much register space.

In terms of Julia support, there’s not much to do until processors start supporting Posit, which still seems a ways off.

1 Like

I just added a link to the (pure) Julia implementation. Note the standard has FMA in some form, with the quire. SoftPost.jl dropped the quire however… but it’s available in the older 0.4.

Why not? I mean there must be a reason it’s mandated in the standard. Something I’m missing. I can see power of three (or higher), useful for [Taylor] polynominals, but that’s binary, not ternary. p x p x p is more general, it must pop up in some important equation I’m not aware of, or/and the rounding of it important.

Nvidia GPUs already have some accumulator. @Elrod I think vectorized (in software) 4- or 8-bit posits might be helpful… I suppose you need the largest quire of your posits if you mix types, i.e. then for those lower-precision too. I think in hardware the registers would only have the sizes you want, e.g. 32-bit for posit32, and only the FPU (PPU? PAU) would need a quire, maybe one each. I’m not so sure 512 bits is too bad (per FPU) in hardware, but yes, slower in software, not too bad though? But you’re competing with Float32/64 (double-double that much used?), even [b]float16, so I think to compete in software you would want to go lower to posti8 or less (possibly times two for intervals).

2-bit to 32-bit with two exponent bits (posit_2_t) → Not fast : Using 32-bits in the background to store all sizes.

@milankl 2-bit Postis I see there are likely overkill… more proof-of-concept, and posit4 I mentioned, might neither be too useful, I was just thinking the software implementation would be really fast, and if needed 4-bit posit x 4-bit posit (or any other binary operation), only needing a 256 byte lookup table for each operation, and a quire fitting in 64-bit register. There’s though no need to avoid 128-bit int with next step, posit8, just requires 2 registers and carry.

I’m aware of your (other) work, e.g. (wrapper):

202x compression of floats [arrays of] is great, I didn’t look into too carefully, but I suppose you need to decompress in chunks, not one by one, so you can’t calculate on directly. It’s unclear if this defeats the purpose of posits or similar would just be built for it (on posit32, so possibly half the compression… down to posit8, 25x?).

Number Formats, Error Mitigation, and Scope for 16-Bit Arithmetics in Weather and Climate Modeling Analyzed With a Shallow Water Model https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2020MS002246

2.1.3 The Posit Number Format

Posit numbers arise from a projection of the real axis onto a circle (Figure 1), with only one bit pattern for zero and one for Not-a-Real (NaR, also called complex infinity)

I’m not sure if complex numbers are used in your (climate modeling) application(s). I don’t see that (as opposed to “more complex models”). But complex of posits (Euclidean) are of course possible. I think though a polar form might be valuable, with the angle just in straight fixed-point, likely 4-bits not useful for it but 8 might often be enough (and better usage of bits than posit8 or Float8), for 8-bit posit plus 8-bit fixed-point angle?

Just to clarify: For SoftPosit.jl v0.5 I wanted to drop the dependency on SoftPosit-C as it’s barely maintained and I rather wanted SoftPosit.jl go its own way, which turned out to be faster and hopefully in the future also quicker to add new featuers. However, I didn’t have the time to implement quires, neither did I ever really ended up using them. If anyone wants to implement quires, I’m happy to help, but don’t want to take the lead.

Similar for other off-standard posit types. Sure, Posit2, 4, 6 are all possible. As far as I know research that uses Float8 often redefines the NaNs as they are a massive waste of bitpattern for such low precision and/or drop subnormals for a simpler implementation. So that clearly points to a better use of posits at such low precision. But again, the hardware question remains, unless someone implements posit arithmetic on an FPGA. Same here, for precision tests one could implement Posit2 or 4 (Posit8 already exists) but I personally don’t have any application for that, meaning I’m happy to help but I won’t do it myself.


Are they (or other larger-than-double formats) needed? Nobody has measured anything that accurately.

I noticed the quire is changed in the standard, to linear in size from quadratic in the draft (breakeven is 512 bits for postit32). In fact posit64 is no longer mentioned, as maybe not important?

Double has: “53 bits (52 explicitly stored)” of fraction length, before posit32 was mentioned to have 28 max significand bits now “0 to 27 bits” fraction length; and posit8 had 8 max, now “0 to 3 bits”.

I see in the draft:

Note: If hardware supports posit multiplication, addition, subtraction, and the quire, all remaining
functionality can be supported with software.

That exact phrase is dropped (but still true?!), now only find, when searching for software:

A posit compliant system may be realized using software or hardware or any combination.

Complex numbers arise in climate modelling mostly through Fourier and other related transforms (spherical harmonic transforms for example). Meaning while input and output fields are real, one may chose a numerical methods that uses complex numbers in between. I thought about this idea to use posits for complex numbers directly given that they arise from a projection onto the circle, but never got really far.

Thinking out loud: Instead of representing a complex number z = x +iy as a sum of a real and imaginary part which can both be represented as float/posit/whatever, one can think about z = mexp(phii) with a magnitude m and an angle phi. If we were to represent z with posits in a magnitude, angle sense, then one needs first an unsigned posit for m (bit annoying to change the posit definition, but lets ignore that) probably want phi = reinterpret(Unsigned,p) with p a posit number to represent the angle. This is because posits are equi-angle distributed around the circle. Multiplication for two such zs is easy z1z1 = m1m2*exp((phi1+phi2)*i), so you’d just multiply the magnitudes with posit arithmetic and add the angles with integer arithmetic. However, what to do with addition (and subtraction)?

z1 + z2 = m1*cos(phi1) + m2*cos(phi2) + i(m1*sin(phi1) + m2*sin(phi2))

So we’d need a handy shortcut to approximate/calculate something like m*cos(phi) which I never got my head around how to do. John’s notebook pdf has a chapter how posits can approximate the sigmoid function, but if anyone has an idea how to approximate cos/sin please let me know. I’d be happy to implement a Complex{Posit} type based on magnitude and angle.

1 Like

Yes, that’s annoying, but possible to do slowly (and e.g. with a lookup table, for small posits). I’m thinking you might make up the slowness, by using Julia’s type system.

Usually operations on types give the same type, but that’s not needed (and e.g. division is already an exception in Julia), so you could have e.g. multiply of PolarComplexPosts give the same type, but adding them give Euclidean. Then you defer changing back to polar, i.e. using hypot2. Maybe the quire helps here (qToP might change back to polar). I’m not sure how it was represented in software (or in hardware).

You could do e.g. 24-bit mag/8-bit angle = 32-bit, not just 8/8.

Note, GitHub - JuliaMath/FixedPointNumbers.jl: fixed point types for julia

keep in mind that they are vulnerable to both rounding and overflow.

but you want the overflows anyway, and I’m not sure rounding is a problem either, since you already get Float64 out (53 bits of significand precision, more than the max 27 of posit32, even close to the 59 of posit64):

x = N0f8(0.8)
julia> sin(x)


World’s first family of multicore CPUs based on POSIT number system.

The have on their website e.g. stuff like “Facebook started using POSIT in their AI hardware”, and link to research in B.

I’m intrigued if that only means GPGPU (or actual graphics):

RacEr- the world’s first POSIT for Graphics Processing Unit (GPU). RacEr is shipped with highly silicon efficient architecture.

SupersoniK- the ultimate multicore CPUs for High Performance Computing (HPC) applications with accuracy exceeding that of double precision floating point numbers.

Their site has an FAQ and they claim e.g. (so likely support Posit16 and Posit32; also Posti8?):

POSIT number system require half data width with respect to IEEE-754 FP. For example, POSIT require 16 bits to compute 32 bits accuracy of equivalent IEEE-754 single precision FP, POSIT require 32 bits to compute 64 bits accuracy of equivalent IEEE-754 double precision, etc.

Literature review
In order to overcome above many adders are invented such as ripple carry adder (RCA), carry save adder (CSA), carry look-ahead adder (CLA), carry skip adder (CSkA) and carry select adder (CSelA) [1]-[7].
With 10 years of R&D efforts we were finally able to crack carry propagation problem puzzle. VividSparks CFA technology detects carry bits in advance by just looking at input bit patterns. Carry Detection Unit has only logic depth of 6 CMOS transistors. Following figure 1 shows over all CFA architecture […]

Further Applications of VividSparks CFA Technology

  • Leading 0’s and 1’s detection in arithmetic circuits
  • Booth multiplier CSA accumulation
  • Booth decoding of multiplier
  • Odd multiple generation in Booth multipliers for 3X, 5X, 7X and so on
  • Comparators
  • Branch prediction operations
  • ALU operations
  • Quire operations in POSIT and
  • many more applications!

Note the claim about Facebook using posits, maybe be about “log (posit tapered)” i.e. “(8, 1, 5, 5, 7) log ELMA” in Table 2, not regular posits, as in the standard. It’s unclear to me it’s better than “(8, 1) posit” or “(9, 1) posit” in the table, but likely slightly better than “(8, 2) posit EMA”, which I think is the Posit standard format.

Rethinking floating point for deep learning

We improve floating point to be more energy efficient than equivalent bit width integer
hardware on a 28 nm ASIC process while retaining accuracy in 8 bits with a novel
hybrid log multiply/linear add, Kulisch accumulation and tapered encodings from
Gustafson’s posit format. With no network retraining, and drop-in replacement of
all math and float32 parameters via round-to-nearest-even only, this open-sourced
8-bit log float is within 0.9% top-1 and 0.2% top-5 accuracy of the original float32
ResNet-50 CNN model on ImageNet.

Typical floating point addition is notorious for its lack of associativity; this presents problems
with reproducibility, parallelization and rounding error [26]. Facilities such as fused multiply-add
(FMA) that perform a sum and product c + aibi with a single rounding can reduce error and further
pipeline operations when computing sums of products. Such machinery cannot avoid rounding error
involved with tiny (8-bit) floating point types; […]

There is a more efficient and precise method than FMA available. A Kulisch accumulator [19] is
a fixed point register that is wide enough to contain both the largest and smallest possible scalar
product of floating point values ±(f 2 max + f 2 min). It provides associative, error-free calculation
(excepting a single, final rounding) of a sum of scalar floating point products; a float significand
to be accumulated is shifted based on exponent to align with the accumulator for the sum. Final
rounding to floating point is performed after all sums are made. A similar operation known as
Auflaufenlassen was available in Konrad Zuse’s Z3 as early as 1941 [18], though it is not found in
modern computers.

We will term this operation of summing scalar products in a Kulisch accumulator exact multiply add
(EMA). […] Both EMA and FMA can be implemented for any floating point type. Gustafson
proposed Kulisch accumulators to be standard for posits, terming them quires.

Depending upon float dynamic range, EMA can be considerably more efficient than FMA in hard-
ware. FMA must mutually align the addends c and the product ab, including renormalization logic
for subtraction cancellation, and the proper alignment cannot be computed until fairly late in the
process. Extra machinery to reduce latency such as the leading zero (LZ) anticipator or three path
architectures have been invented [28]
5 Multiplier efficiency
Floating point with EMA is still expensive, as there is added shifter, LZ counter, rounding, etc. logic.
Integer MAC and float FMA/EMA both involve multiplication of fixed-point values; for int8/32
MAC this multiply is 63.4% of the combinational power in our analysis at 28 nm (Section 8).

A logarithmic number system (LNS) [16] avoids hardware multipliers entirely, where we round and
encode logB (x) for some base B to represent a number x ∈ R. Hitherto we have considered linear
domain representations […]

We will name this (somewhat inaccurately) exact log-linear multiply-add (ELMA). The log product
and linear sum are each exact, but the log product is not represented exactly by r(p(f )) as this
requires infinite precision, unlike EMA which is exact except for a final rounding.

Table 2: ResNet-50 ImageNet validation set accuracy per math type
This encoding we will refer to as (N, s, α, β, γ) log (posit tapered)

I think we can make posits faster than regular floating-point, in software, on regular CPU, like x86 and ARM (for the vast majority of operations, most that matter?).

The trick would be to rely on Julia’s type system to help, by relying on from (code comment) “Posit8, 16 are subsets of Float32” and “Posit32 is a subset of Float64”.

Right now:

julia> Posit8(1)+Posit8(1)

but if we defer rounding and converting back to Posit8 we can just use Float32 for the calculations:

julia> Posit8(1)+Posit8(1)

The quire at 128 bits for Posit8 is irrelevant (for at least that operation), because most of the defined arithmetic isn’t defined to return (or use the quire as inputs, or it seems as intermediate) such a result.

When you load from memory you have to decode the Posit[8], yes, loading as from memory is slow anyway, and the 2x-4x more bandwidth requirement of regular floats should make up for it.

Then you’re just doing the same float calculations there-after, hopefully a few before you need to store to memory, and then you need to round. I guess you do not round as often then, so will not get bit-identical results, not standard-conforming, but I think the software solution should be more accurate with less rounding.

A similar trick could actually also be used instead of (where/since it was thought of a storage-format, while that is changing and CPUs starting to support):

julia> Float16(2)+Float16(2)

An potential option for (these new) posits, and well floats too, is:

It’s currently slow, because you’re forced to do it each time.

In general, if you multiply say 8-bit by 8-bit you get 16-bit, doubling number of bit guarantees you don’t need any rounding (unless you do repeatedly, and FPUs rounds anyway).

If you however add (or sub) two 8-bits you need only 1 extra bit, assuming you’re working on integers (or fixed-point), not floats. For floats you have possible catastrophic cancellation (with sub), but I don’t see it getting any worse with my idea. The quire is, it seems, to mitigate that, but you need to opt into using it (and it’s not longer supported in the pure-Julia SoftPosit.jl).

If you want the quire, there are only the 10 operations to consider (though likely also used indirectly in “5.5 Elementary functions of one posit value argument” too, or avoidable?):

5.11 Functions involving quire value arguments

to support. And it seems, it needs not be too slow to support. And only a problem if you use it often; e.g. in a loop.

The only function there seems not bad, as will fit in a CPU float:

5.7 Functions of three posit value arguments
fMM(posit1, posit2, posit3) returns posit1 × posit2 × posit3, rounded.