Posits - a new approach could sink floating point computation

I really need to complete my second book, one that focuses just on posit arithmetic and does a better job of showing both the advantages and disadvantages of posits compared to floats. I’ve never cherry-picked examples to make posits look good.

For most situations, I don’t want the compiler to do anything to transform code that uses posits. I was making a reference to the technique used in ACRITH, PASCAL-XSC, and C-XSC where a basic block with plus-minus-times-divide is converted to a lower triangular system of equations; that system can then be solved to within 0.5 ULPs using residual correction, by using the quire to evaluate the residual with only 0.5 ULPs maximum error. Kulisch and colleagues did great work automating the process and getting IBM to commercialize it, but they went after super-high precision for which there is really not much of a market. Had they instead shown that 32-bit or 16-bit representation could safely replace 64-bit representation and thereby saved time/energy/power/storage/etc., I think it might have gotten some traction.

The Marc Reynolds piece is quite hilarious (intentionally), peppered with Don Rickles-like insults. It is relatively free of mathematical or factual errors. The same cannot be said for the work coming out of INRIA, though I went out of my way to make sure they got a paper into our Conference on Next-Generation Arithmetic (CoNGA) last March. The long-established theorems about floating-point error all depend on excluding some of the range. Many theorems fail for subnormal floats, and if multiplication or division are involved, they exclude HALF of all possible inputs to avoid overflow and underflow! If you apply similar exclusions to posits, you find they satisfy even stronger theorems and have provably higher accuracy over the allowed range. The statement that posits can produce huge relative errors when they get into the very large magnitude range where there are few bits of significand is true; it is equally true that floats can produce infinite relative errors, for a very large part of the set of all possible inputs (about 25 percent when computing a product), making them a far weaker stand-in for real numbers.

The main author, Florent de Dinechin, is a brilliant guy and while critical (like Kahan), he’s intrigued by posits. But he makes some howling errors, like a recent paper where he says posit hardware has to test for the 0 and NaR exception, but IEEE floats do not, so he claims float hardware is cheaper. I figured out what he means: He assumes all exception cases are handled in software! Which works, but it’s two orders of magnitude slower and also provides a security hole like Meltdown and Spectre. So please read de Dinechin’s work with a grain of salt.


Are you referring to https://ieeexplore.ieee.org/document/7974115 ?

Edit: The abstract of a newer paper describing actual hardware (?) is quite positiv and promising, to quote
… The flops performance of this architecture remains within the range of a regular
fixed-precision IEEE FPU while enabling arbitrary precisioncomputation at reasonable cost.

OK, replying to myself instead of editing again. After some more searching I make a guess that the reference is in fact
" Hardware cost evaluation of the posit number system"

The conclusion repeats the claim that classical error analysis does no longer apply,
“The accuracy win in many situations
should not hide the loss of some properties at the core of classical error analysis [5], which
requires further studies.”
I would appreciate it if someone could dig out a reference which shows the complete body of numerical error analysis done for Posits.

The fact that the error of posit operations are generally not representable as another posit value seems quite problematic. That’s a very useful and valuable property of IEEE floats which holds except at the extremes. The need to rework all of numerical analysis de novo without constant relative error is also troubling. That essentially mean me that you can’t just apply any standard numerical analysis results to posit calculations.

It seems that the precision benefits of posits are due to having more precision near 1.0. This is great for computations in that range but there are significant downsides:

  1. Lack of scale invariance
  • With floats as long as the entire computation remains in [-realmax(T), -realmin(T)] ∪ [realmin(T), realmax(T)] or in other words, as long as there’s no overflow or underflow, then scale doesn’t matter: the computation will be just as accurate at any scale; if the scale factor is a power of 2, scaling before or after gives the identical answer.
  • With posits this is no longer true: scale matters very much. There is a “golden zone” where it’s at least as accurate as floats, but that zone is quite small, roughly ±10⁶ for Posit32 as compared to ±3.4×10³⁸ for Float32.
  1. Non-closure under operation error
  • Float operations have errors which are also floats except at the extremes, which is of great use when writing algorithms that are guaranteed to be correctly rounded.
  • The same is not true of posits. It’s unclear how to write correctly rounded algorithms for posits without the ability to represent the error of operations precisely.
  1. Non-constant relative error
  • With floats, relative error of arithmetic operations is constant for each floating point type. This means the overall relative error of a computation is a straightforward function of the number of arithmetic operations in the computation.
  • This is not true of posits. It is unclear how to determine the relative error of computations with posits. It’s not a straightforward function of the number of arithmetic operations.

There are some vague claims that compilers can somehow mitigate these problems, but I don’t really see how a compiler would do that. Would the compiler be expected to turn a simple arithmetic operation into a scale check with multiple code paths depending on the scale of the values involved? That sounds like it would be quite bad for performance. I’d love to hear some clarification about what compilers might do to mitigate these problems.


Hi John, thanks for the reply.

I would still call this a compiler transformation: the semantics of the code no longer match the operations being performed, and moving operations between blocks could then give different results, even if the order of operations is preserved. There is also the question of what to do if the problem requires more quires than the hardware can provide: should intermediate values be truncated when spilling to memory? I actually think the quire is one of the more intriguing parts of your proposal, and have often wished for better ways of storing and working with intermediate higher-precision values, but without more work on how they should be handled at the programming language and compiler level, I suspect they will face predictability/reproducibility problems similar to those which plagued the use of the 80-bit mode with x87 instructions.


I agree that Julia is by far the best platform for exploring posits (or any other new number format). I hope that SoftPosit helps enable this. Julia was the first platform to support posits because Yonemoto got them working in Julia in a matter of days after I invented them in December 2016. They were very general but not very fast. With SoftPosit, they are constrained to the following sizes (nbits = total number of bits; es = number of exponent bits)

2 to 8 bits with es = 0
2 to 16 bits with es = 1
2 to 32 bits with es = 2

and we’re seeing about 30 clock cycles per simulated posit operation on a modern x86 processor, very similar to what SoftFloat achieves (John Hauser’s IEEE 754 implementation, which is what SoftPosit is patterned after).

1 Like

No, that one was about Type I unums. The one I was referring to is pre-published here:


Besides the outrageous claim that only posits need to deal with exceptions in hardware, the paper cheats by maintaining all floats in an easy-to-process decoded form, then charges posits the cost of converting into that form and back but not the floats. I’m not sure who refereed the paper, but that should have been caught and corrected.

Peter Lindstrom just wrote an email to me that he tested a small algorithm adjusted to use the subnormal range of IEEE floats, and it ran about twenty times slower than if it used normal floats. This is one of the “dirty little secrets” of IEEE 754 hardware implementations: they have a wildly erratic performance profile if all possible inputs are considered; they tend to quote only speeds for which multiply-adds pipeline perfectly since any exception throws you into a software or microcoded handler that takes about 200 clock cycles instead of 4 or 8. I found the following paper, “On Subnormal Float and Abnormal Timing” to be quite an eye-opener:

Perhaps the most interesting hardware question is this: How much extra hardware would be needed for an IEEE-compliant FPU to also support posits? The first commercial CPUs that support both data types will almost certainly use this approach, and I believe only a very small amount of additional hardware is needed if the float hardware actually supports subnormal arithmetic instead of throwing an exception when all the exponent bits are zero. The “count leading ones” (CLZ) instruction is needed for subnormal floats and for every posit input, so once you have a fast CLZ in hardware, it’s just a matter of making a subtle change to the use of all the building blocks that make up a standard FPU. Peter Hofstee of IBM, inventor of the Cell processor and a very senior VP at IBM, believes a POWER processor should be able to run both posits and floats with, at most, one additional pipeline stage and no difference in throughput.


But does it fold :wink:. Hi John.

I perhaps am not adding much to the discussion here.
Fantastic to see @John_Gustafson in the discussion - please hang around this board, the Julia community are friendly and helpful.
I guess I am not saying anything new when saying that less clock cycles also equals less energy use. As an HPC person, I have read that Exascale is achievable now - all you have to do is throw money and energy at the problem. For a real world Exascale setup the target of 30 Megawatts of input power has been set - so clearly power per computational task becomes vital.
Also I read that the data centres in the world consume as much power as the nation of Italy - that probably has gone up since I got that statistic.

The discussion on posits, and in general looking at precision is refreshing to me. I am no numerical geek but through the years I have seen the ‘throw double precision at everything and it will be alright’ philosophy.
I took the machine learning and GPU community to re-examine this and do work on using less bits per number (*) and still getting valid results.

Finally as a physicist, I think remembering that there is a difference between precision and accuracy is relevant here.

(*) I wont fall into the trap of saying less precise here

1 Like

OK, thank you. Much appreciated. A small remark, there appears to be at least one newer revision. Unfortunately it appears to be impossible to see the changes directly.

I believe that https://hal.inria.fr/hal-02087098 I mentioned earlier argues quite convincing in a similar direction. Hardware support for “arbitrary” (of course it is limited by the hardware in the end) precision would be a very nice thing to have if one needs it. And since Posits are apparently a superset containing the IEEE floats that IMO makes sense, either for specialized hardware or as transition strategy.

EDIT: Changed the second link, i.e. https://hal.inria.fr/hal-02087098, to point to the right paper. Sorry ! And yes, this is Unum Type I and not Posit.

A perhaps throwaway remark. Intel are promoting FPGAs these days, with FPGAs being bonded into th same package as conventional CPUs. I understand the concept of FPGAs. However I don’t understand how the data bits get across to the FPGA. So saying that “soon we will have reconfigurable CPUS which can be changed to do most any posit operations” is probably well wide of the mark.
I ought to look at how the FPGA is connected - nothing really can beat being encoded in the CPU silicon and also being at the top of the cache hierarchy.
Can anyone comment - are the FPGAs in the same logical place as DRAM - ie outside the caches?

While these are fascinating questions, are you sure that this forum is the best place to discuss them?

Especially in a tangent to a topic that is already pretty much tangential to Julia.

If not here then where? I think the Julia community gathers a higher concentration of people interested in unusual numeric types than anywhere else, partly because of the language’s unique ability to implement them and use them easily, and partly just because of the nature of people using the language. Anyone who doesn’t want to discuss it can always mute this thread :grimacing:


I haven’t been following the latest developments in posit land. Are lookup tables still the suggested—at the time, the only—way of implementing posit arithmetic, or was that only the case for an earlier incarnation of the design?

I was referring to the FPGA sub-topic.

Generally the way these integrations work is that the FPGA part of the chip gets cache coherent access to the CPU bus. I.e. you can use it as a fast on-chip accelerator, but not to implement extra instructions for your CPU (unless you’re willing to stall your CPU cycles while the instruction finished). On the ARM/Xilinx integration side, this generally seems to be an AXI bus, which is fairly standard in that world. I don’t know very much about the Intel version of this, but their marketing suggests that they connect the FPGA on die using UPI, which does have cache coherency support (similar to multiple CPU packages), but I’m not sure to what extent that’s exposed to the programmable logic. You will be perhaps unsurprised to learn that the FPGA industry is riddled with absolutely horrible archaic toolchains, so even if experimentation is possible, it’s certainly not pleasant. I do see some movement in the open source FPGA space these past few months, so I’m somewhat hopeful that this will change in the not too distant future.


Something that Xilinx is working on in particular with their Zynq parts (which integrate ARM cores and FPGA into the same silicon) is a set of tools to allow one to profile software compiled for the ARM and then decide to accelerate it by implementing functions as hardware in the FPGA. They also have the ability in that family to do dynamic reprogramming of the FPGA (portions may be reprogrammed without a system reset or reboot).

It’s a bit of a stretch, or maybe an enormous stretch, but the hardware pieces are there which would allow a flavor of Julia that did JIT compilation to a processor+FPGA combo.


I think that was for “Unums 2.0”: https://ubiquity.acm.org/article.cfm?id=3001758, Posits are (I belive) the third iteration.

1 Like

Yes, Posits do not rely on lookup tables, there are algorithms which would work in hardware very similar to floats, they just need a “count leading 0’s or 1’s”, to decode the regime bits, but as far as I know from the posit hardware people I spoke to (Peter Hofstee, IBM, for example) that’s not a bottle neck.

Concerning lookup tables: I’m currently working on SoftSonum.jl - a self-organizing number format that learns from data. The idea is that you tell your number format what numbers you want to calculate and it will figure out what dynamic range is appropriate and where to put the precision. It is still based on ideas around the posit circle (page 16 ff.). Under the hood it works with lookup tables that have to be precomputed. For 16bit this is on the edge of what’s feasible (requires 4GB of RAM), but for 8bit this is even attractive as a hardware supported lookup table. I’m more planning to use it as a testcase to learn what properties a somewhat “optimal” number format would have and to better understand if and why posits are better than floats in the applications that I came across so far. If anyone wants to contribute to that project - you are very welcome to!


I want to comment on the “Lack of scale invariance” point. What you describe as an advantage for floats, I regard as one of the bigger inefficiencies of floats, so yes, it’s probably a matter of the perspective but let me briefly explain mine: Having an about constant precision across the whole range of representable numbers is nice in a sense that a programmer doesn’t have to worry about where to perform computations. As long as there are no overflows nor underflows occuring you can basically expect the result to be just as good as it gets with floats. This reminds me of the “Just throw double precision at everything”-attitude, which, don’t get me wrong, brought scientific computing very far, as one basically doesn’t have to worry about precision.

However, going down to 32bit, 16bit or even further is a question of optimizing performance/energy/storage etc. and in that sense putting a lot of precision where it’s rarely needed as floats do it, is not an optimal use of your available bit patterns. I would therefore like to rephrase the “Lack of scale invariance”-disadvantage you mention as an advantage for posits: In contrast to floats, posits allow you to rescale* your algorithms to make a more efficient use of higher precision around 1. At the same time, with posits you get a wide range of representable numbers to avoid hitting floatmin/max in case it turns out to be difficult to squeeze parts of your algorithm into a small range.

Yes, this calls for a paradigms shift in which it’s up to the programmer to have a rough idea of the numbers that occur and scale* algorithms accordingly. I’m not talking about programmers that write general-purpose code but the ones that use them for a specific application. Let me explain this by explaining the *s

(*) Rescaling here means any of the following approaches that analytically lead to the same result, but may alter the result significantly when using finite precision arithmetics: Non-dimensionalization, actual scaling with a multiplicative constant, reordering the arithmetic operations, precomputing constants at high precision, and in general avoiding intermediate very large/small numbers.
Example: Say you want to compute energy via the Stefan-Boltzmann law, you could simply write

const σ = 5.67e-8
f(T) = σT^4

However, with Float16 this causes an overflow for T = O(100). One way would be to pull σ into the power 4 and precompute σ^(1/4) (ideally at high precision),

const σ_onefourth = σ^(1/4)
f2(T) = (σ_onefourth*T)^4

Overflow problem solved. Another way would be to say, hmm, T = O(100) is not great so why not Tₛ = s*T, with s = 1/100 to have Tₛ = O(1), then

const σₛ = σ/s^4     # which is 5.67, so O(1)
f3(Tₛ) = σₛ*Tₛ^4

So, you end up with the same number of operations, but all O(1), therefore robust to overflows and you can make use of the higher precision of posits around 1.