Overview of different number representations, philosophies and use cases


there is some activity to implement different number formats other than Rational and float. All approaches known to me so far try to tackle a fundamental tradeoff:
We want our calculations to be fast and accurate. However, on real computing machines, we cannot have both.

Every digital design textbook treats number representation as Int and IEEE 754 floats, but most of them dont mind (software) approaches to more accurate (=rational) alternatives.

The purpose of this topic is to prepare a part of documentation or report, which
gives users / programmers / scientists advice for selecting the best number representation / package for their julia project and potentially steer future efforts in implementing number formats in julia.
My idea is to collect existing packages here, discuss background information and categorize / generalize different existing approaches.
(Maybe some digital design basics are needed as background information, as computer engineer I’m surely biased.)

Contributions are very welcome!

In the following, dynamic is a measure of the greatest and smallest values the number types can represent. If we only have values close together in our calculations, we need low dynamic. If we deal with values which differ by orders of magnitude, we need higher dynamic.

We are used to do arithmeric over the real numbers. But with a finite number of bits we can only
represent rational numbers exactly. Assume that we can represent all reals of interest with big or small enough numerators and denominators.
Note that with finite number of bits we loose associativity and commutativity, so the compiler is in general not allowed to reformulate our calculation order.

Fixed point numbers

(fast side of the medal)

With a fixed “decimal” point or fixed denominator, the underlying representation (bits) can be simplified to just integers. Artihmetic for integers is fast.
We can only add two fixed point numbers, if they have the same denominator or fractional part. The julias typesystem can be used to ensure we dont mix different fractional part up.
Multiplication is the same as for integers, often we want a rescaling afterwards.
When we choose the denominator or fractional part as a power of 2, the expensive rescaling becomes a trivial bit shift operation.
The dynamic is fixed, so calculations may easily produce values which will not fit the to the input representation. The so introduced errors (over-/underflow and truncation / roundoff) need extra attention.
Often, but not always, you have some a priori knowledge of the domain / inputs, which helps in the selection of needed number of bits and selection of introduced error behavior.
Trading performance for accuracy, one may also dynamically check if an error occured and take some counter measures.


(more accurate side of the medal)
When we allow the denominator to also change, we can greatly increase the dynamic. So we can represent a broader range of values. Without restrictions on the values of numerator or denominator, the represented value V=n/d is exact (as with Base.Rational{BigInt})
Artihmetic now needs some extra steps you may recognise from elementary school: E.g. if we add two numbers: First we need to calculate a common denominator, than rescale the numerators, add them and rescale / reduce the result. The timely part here is the rescaling / reduction step: For the reduced representation one needs to calculate the greatest common denominator (GCD) of numerator and denominator (no hope for significant speedups here).

Some packages try to avoid the GCD part as long as they can or avoid it completely (allowing silent errors).

Floating point numbers

For floating points we restrict the denominator to powers of two. This allows as to largely increase the dynamic, as we can use an integer e to represent 2^e for the denominator.
The extra steps to obtain common denominators are still needed. But the restriction to powers of two denominators, these procedures can be implemented nicer in digital hardware than for the general case.

The IEEE 754 standart was introduced as a common, machine-indepentent representation. This enables (hardware) implementations of the mentioned extra steps, so users dont need to think about any floating point representation details by default.
On all (?) desktop computers today we have supporting hardware which allows reasonably fast artihmetic with IEEE 754 floats. However, an floating point operation may take say 2x…10x longer than an integer / fixed point operation.

Logarithmic numbers

Numbers can also be represented by the logarithm of their magnitude. Such a representation has really high dynamic from a fixed number of bits. They are used in domains where relevant values lie several orders of magnitudes apart, e.g. in some signal processing, probability or machine learing.
Operations like multiplication, division, roots or powers are relatively cheap.
But the steps needed to perform addition and substraction are much more expensive than for floating points.

1 Like

Also probably worth mentioning Posits.


Interesting topic! I think there are lots of other things to think about, including:

  • how various representations handle overflow, underflow, and rounding (I’d argue that underflow and rounding are nearly linked)
  • to what degree a “unitful” quantity should mimic the traits of the raw number type used to store such values

This has been coming up a lot lately in the context of DateTime arithmetic: DateTime + Nanosecond loses information · Issue #50785 · JuliaLang/julia · GitHub and `DateTime` arithmetic should not have wraparound behavior · Issue #50930 · JuliaLang/julia · GitHub.

Some interesting packages:

I don’t know of a package that combines modifications to both traits, underflow and overflow.

There are lots of interesting thoughts one develops when one thinks about this. Python models int more like a fixed-point numbers than entities used for counting:

>>> int(3.2)

(so in Python, int attempts to model the real number line, and rounds-to-nearest) whereas we do this:

julia> Int(3.2)
ERROR: InexactError: Int64(3.2)
 [1] Int64(x::Float64)
   @ Base ./float.jl:900
 [2] top-level scope
   @ REPL[1]:1

So for us, integers are about counting things.

Both Julia and python use Int/int for indexing. Does that mean that python should support indexing like this?

>>> "cat"[1.4]
<stdin>:1: SyntaxWarning: str indices must be integers or slices, not float; perhaps you missed a comma?
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: string indices must be integers
>>> "cat"[int(1.4)]
1 Like

Another example – from the very slow end:
Computable Reals (Implementation is in Common Lisp though)

I consider this to be the better alternative. My workflow often starts with stating the domains of variables in a mathematical sense. At least in my head, at least to the point that I can distinguish logics / counting from “number crunching”.

This reminds me of a half-page long footnote from the first chapter of the Wizard Book:

There was / is also a discussion on what a number in Julia is / should be: Interface for number. I’m unsure if the proposed interface / testset exists now. I think my intention was focused practical scalars used for number crunching: T <: Real. Hopefully more advanced stuff like quaternions can be parameterized with underlying T.

Different number formats come to my mind only when I have to think of some sort of scaling or restrictions: How long will these calculations take with large inputs? Will it work on a microcontroller, accelerator, etc.? Is inexactness a problem, what can go wrong? Such things…

1 Like

Right, and I DO believe we can have both, on current hardware (not Posit hardware which already exists though). We can at least do better.

I’ve been meaning to implement rather than just discuss, but since you ask about philosophies (and representations), there e.g. also:

It bugs be that 1/10, 1/3 and more isn’t exact in general. I think the solution is fast rationals that degenerate to floats and track number of significant digits. And measured values should also use Unitful.jl e.g. 10 m/s / 5 s = 2 m/s².

With 8 bits you can have binary fixed point (already available in a package), but I rather like decimal (and know of no fixed-point decimal package yet, it’s understandable, because it’s awkward without my idea, e.g. in DecFP). I want to unify integers, rationals, floats, SIMD, complex numbers (and higher-complex) possibly, and matrices in one 64-bit type (in ultimate implementation it could represent a pointer to e.g. a matrix, but hopefully not often for e.g. rationals, though needed for e.g. BigInt rationals). E.g. 4x 10-bit decimal fixed-point from -10.24 to +10.23 (I might also exclude -10.24, and use the bitpattern for a NaN), i.e. 3 significant decimal digits for each of the 4 numbers, and it should be enough for most inputs (and outputs?). That leaves 24 bits for a shared common denominator, and other ideas I don’t go into here. If you multiply two numbers, then the numbers expand to 20 bits of numerator each, no rounding or overflow check needed. The type system keeps track.

This is still not fewer bits than Float16, but never loses accuracy, until you take a square root (or other fractional power), that I’ve also thought of.

1 Like

With respect, for 99% of use cases in scientific computing, tackling accuracy from the side of number representations is a distraction.

Float64 should be the starting point, and is reasonably accurate most applications. When it isn’t, think about the conditioning of your algorithm and transforming the problem domain, not the number representation. If you do this, you may find that you can get away with Float32 or even less, but again, for most applications this is not necessary (unless they are really constrained by hardware, usually RAM).

Eg working in the log domain is very helpful for a lot of problems, so is working with alternative representations and factorizations of your matrices, and so on. Each field has a lot of specific tricks.

I am not saying that 128-bit floats and bigfloats are not useful for some cases. Unit testing your algorithms written using good old Float64 is the most common use case. And then there is that <1% of problems that may require more precision. But throwing a lot of precision at an ill-conditioned algorithm will not give you robust solution, change some parameters and it is very easy to run out of precision you have just doubled, quadrupled, etc.