DoubleFloats.jl -> Double128 | Quad64?


I need some kind of high precision arithmetics to do computations, and I’m trying to find the best balance between runtime and precision. Using BigFloat, i found out that 256bits BigFloats give me good enough results so that my algorithms converges, but these are painfully slow.

For 128 bits computations, the DoubleFloats package gives a stunning performance with it’s Double64 type, which is 50x faster than Bigfloat(128) for my use case. Unfortunately, 128bits is not quite enough and I’d like to switch to a 256bit alternative.

I was wandering if the techniques used in DoulbeFloats.jl could be recursed to build a Double-Double64 interface ? It will probably still be faster than BigFloat(256). Or maybe somewhere in the wild exists a Quad64 type that correspond to this ? Or simply a Double128 ?


This can be done, but the implementations of some things (mainly special functions) get increasingly complicated. One thing to check out is ArbFloats which is often faster for this precision range.

1 Like

All I need is exp, log, +, -, * and /

I’ve checked ArbNumerics.jl, but this is still very slow (2x speedup compared to bigfloat, again on my problem), slow enough so that the optimizer that work on the function cannot give me anything.

How hard would it be to patch DoubleFloats.jl to have a Quad64 / a Double128 type, even if the implementation is minimal (as i said, i dont need much specials) ?


I am also interested in higher than quad non-arbitrary precision so I’ve been tracking landscape in that area.

First of all you can try Float128 ( but it’s only marginally more accurate than Double64.

Secondly, there is TripleFloats package. It seems a bit incomplete, but maybe will be enough for the operations you need.

Finally, there is also QuadDouble in C++ which you could try to wrap (maybe with ccall) if you need only
exp, log, +, -, * and / but beware that in most other languages dispatch of these to vectors/matrices is not given :smiley: There is also Julia wrapper


How hard would it be to patch DoubleFloats.jl to have a Quad64 / a Double128 type, even if the implementation is minimal (as i said, i dont need much specials) ?

It will probably be easiest patch DoubleFloats using two Floats128 from Quadmath. I’m not sure how much work would that be but authors of DoubleFloats will probably be interested and maybe could provide guidance. There was also DoubleQuad in Fortran I think (Double Float128s) by Bailey but I can’t find it on his website anymore.

Small remark: Float128 is not the same as Double64 and Float256, Double128, Quad64 are also not the same.

Yeah already tried. Great, but not enough for me. I will look at TripleFloats.

It seems like the project (the wrapper) is quite old and has not been updated in a while. The problem is that wrapping C++ is probably not the solution i want, since it’s not as efficient regarding memory management than pure julia.

Yeah this looks nice. Maybe I will post an issue on the DoubleFloats.jl repo to ask them their oppinon.

To my understanding, Float32, Float64 and Float128 has to respect some standard, and DoubleX is a construct upon FloatX. I also think naively that Double{Y} where Y = Double{X} and Quad{X} are the same thing, but I’m probably wrong.

Well, i really do not get it. I tried :

julia> using Quadmath

julia> using DoubleFloats

julia>  Double128 = DoubleFloats.DoubleFloat{Float128}

julia> Double128(1)
ERROR: MethodError: no method matching DoubleFloat{Float128}(::Int64)
Closest candidates are:
  DoubleFloat{Float128}(::Real, ::RoundingMode) where T<:AbstractFloat at rounding.jl:200
  DoubleFloat{Float128}(::T) where T<:Number at boot.jl:716
  DoubleFloat{Float128}(::Rational{S}) where {S, T<:AbstractFloat} at rational.jl:113
 [1] top-level scope at REPL[26]:1

I dont get it. typeof(1) == Int64 wich is both smaller than Number and Real :rofl:

I did not tested this package yet, but in their readme they claim that it allocates much less memory than BigFloats and is generally allocation-friendly.

That will be great. Probably they wouldn’t want to introduce dependency on Quadmath in DoubleFloats, so likely new package (DoubleQuad.jl?), that depends on both might be the right solution. I will be interested and could provide some test cases.

To my understanding, Float32, Float64 and Float128 has to respect some standard, and DoubleX is a construct upon FloatX. I also think naively that Double{Y} where Y = Double{X} and Quad{X} are the same thing, but I’m probably wrong.

I think Float256 is also part of the same standard as Float128 (or probably we should call them binary128/256 more correctly). From what I understand using two Float128s would give for example smaller exponent ranges as is the case for Float128 vs DoubleFloat64. As I understand from DoubleFloats docs type Double{FloatX} has the same range as FloatX, so to my understanding using four Float64s will also give the same range of exponent as Float64 (its done this way in Quad-Double library of Bailey).

Okay i’ll try that and come back to you for some test cases in a few days, or months, depends :stuck_out_tongue:

Thats great, if you manage to write something please post here or ping me in PM, maybe I’ll be at least somewhat able to help :slight_smile:

Will do. In the meantime, i posted an issue there. Thanks for the guidance !

1 Like

If you only need basic arithmetic (and log/exp) you may try
Double Double floats are as performant since these use machine instructions where possible and don’t allocate every arithmetic allocation. BigFloats do allocate. Most of the mutating interface to BigFloats is hidden in julia (by design). Arblib.jl exposes all of it. Your functions may not look pretty (or readable), but as long as they will not allocate you should be on the fast side. e.g. for a simple normpdf you get ~4×speedup over BigFloats and the answer is guaranteed (which might be totally irrelevant for you:).

using StatsFuns, Arblib, BenchmarkTools

function normpdf!(res::Arblib.AcbOrRef, x::Arblib.AcbOrRef; tmp = zero(res))
    Arblib.mul!(res, res, 2)
    Arblib.rsqrt!(res, res) # res = 1/sqrt(2π)

    Arblib.pow!(tmp, x, 2)
    Arblib.div!(tmp, tmp, -2)
    Arblib.exp!(tmp, tmp) # tmp = exp((x^2)/-2)

    Arblib.mul!(res, res, tmp)
    return res

x = rand(); res, x_arb, tmp = Arb(), Arb(x), Arb();


julia> @btime normpdf($x)
  7.310 ns (0 allocations: 0 bytes)

julia> @btime normpdf($(big(x)))
  4.569 μs (21 allocations: 1.07 KiB)

julia> @btime normpdf!($res, $x_arb, tmp=$tmp)
  1.460 μs (1 allocation: 48 bytes)
[0.3292325887263669388296340295015865054219377965899316138747622936422947790099 +/- 5.82e-77]

I’m curious, what kind of algorithm are you using that is struggling to converge with Double64? Could it somehow be reformulated for numerical stability?

@abulak Well, i tried Arb already, and the time is still unpossible for my problem. As you said, what i want is DoubleDouble to use machine calls, which is where the speed comes from.

@baggepinnen Well, the function i am minimizing is the loss of a certain laguerre expensions, which coefficients are given in close form from parameters, but the close form is highly combinatorial (and costly), producing a lot of overflows. Hence the optimizer cannot ‘see’ anything, and since the problem has a lot of local minimas, it does not work.

1 Like

well, you tried ArbNumerics.jl, which in particular is terrible for handling conversions and promotions with Arbs and also follows immutable design. But of course without seeing the actual function there is no way for me to tell if it will be better with Arblib.jl or not.

You are right. I’ll try tomorrow. If I understood correctly, I’ll have to rewrite the whole thing to take full advantage of Arblib, taking care by hand of allocation and mutability ?

Indeed; If you can you may share the function you’re trying to optimize, so that we can both have a look, but the general pattern is exemplified in the function above

Please post an issue with ArbNumerics that details what you want of conversions and promotions.

It would be helpful to see an example of the computation you are doing. It may be the case that a structural change would give you the convergence you seek without requiring the computation be performed with 256bit precision. Or maybe not. Perhaps you can use Double64s to better constrain the interval of convergence, and then use higher precision within that interval. All this is speculative without better understanding the task at hand.

1 Like

The issue is not that they (conversions/promotions) don’t work: they’re general enough to work very well, but they are too generic, i.e. lack fast paths:

julia> using ArbNumerics, Arblib, StatsFuns, BenchmarkTools

julia> x = rand(); x_an = ArbNumerics.ArbReal{256}(x); x_arb = Arblib.Arb(x); # global precision=256

julia> Arblib.set!(x::Arblib.ArbLike, ::Base.Irrational{:invsqrt2π}) = 
    (x[] = π; Arblib.mul!(x, x, 2); Arblib.rsqrt!(x, x); x) # needed for normpdf

julia> begin
       @info "normpdf: Float64, ArbReal, Arb"
       @btime normpdf($x)
       @btime normpdf($x_an)
       @btime normpdf($x_arb)
       @info "(x+1000)/2 + π: ArbReal, Arb, Arb-mutability"
       @btime ($x_an + 1000)/2 + π
       @btime ($x_arb + 1000)/2 + π
       a, b = Arb(), x_arb
       π_arb, res = Arb(π), Arb()
       @btime (Arblib.add!($a, $b, 1000); Arblib.div!($a, $a, 2); Arblib.add!($res, $a, $π_arb))
       @info "(x+1000.)/2. + π: ArbReal, Arb"
       @btime ($x_an + 1000.)/2. + π
       @btime ($x_arb + 1000.)/2. + π
[ Info: normpdf: Float64, ArbReal, Arb
  6.903 ns (0 allocations: 0 bytes)
  3.465 μs (26 allocations: 1.35 KiB)
  1.768 μs (10 allocations: 592 bytes)
[ Info: (x+1000)/2 + π: ArbReal, Arb, Arb-mutability
  1.491 μs (22 allocations: 1.05 KiB)
  354.564 ns (8 allocations: 480 bytes)
  117.997 ns (0 allocations: 0 bytes)
[ Info: (x+1000.)/2. + π: ArbReal, Arb
  1.047 μs (14 allocations: 768 bytes)
  362.411 ns (8 allocations: 480 bytes)

@Irnv Those allocations in normpdf will bite you sooner or later, and then you need to rewrite it to the mutating style.

btw. I couldn’t reproduce the issue I once had with ArbNumerics and double integrals, since the functionality is no longer there.


The loss function that i am optimizing was previously posted there, but is now incorporated in this package, with a type-agnostic interface.

This simple example should be enough to see the problem:

import Random, ThorinDistributions
using DoubleFloats
using ArbNumerics

n = 8
d = 2
m = (40,40);

alpha = rand(n)
scales = reshape(rand(n*d),(n,d))

Errors = Dict(Float64 => (0.0,1.0,0.0), Double64 => (0.0,1.0,0.0), BigFloat => (0.0,1.0,0.0), ArbFloat => (0.0, 1.0, 0.0))
for T in (Float64, Double64, BigFloat, ArbFloat)

    alpha = T.(alpha);
    scales = T.(scales);

    d = ThorinDistributions.MultivariateGammaConvolution(alpha,scales);
    samples = zeros(T,(2,10000));

    E = ThorinDistributions.empirical_coefs(samples,m); # empirical coefs
    coefs = ThorinDistributions.get_coefficients(alpha, scales,m); # theoreticla coefs
    time = Base.@elapsed(ThorinDistributions.get_coefficients(alpha,scales,m))
    aloc = @allocated ThorinDistributions.get_coefficients(alpha,scales,m)

    # All these should have the same magnitude, very small.
    Errors[T] = (Float64(sum((E - coefs)^2)),time,aloc)

# Type => (Error, time, allocations)
julia> # Type => (Error, time, allocations)

julia> display(Errors)
Dict{Type,Tuple{Float64,Float64,Float64}} with 4 entries:
  BigFloat => (6.09923e-5, 1.01388, 2.9704e8)
  Float64  => (1.01502e20, 0.320554, 1.37131e8)
  ArbFloat => (0.000182493, 0.893221, 2.22239e8)
  Double64 => (0.000136315, 0.397989, 2.7385e8)


E and coefs should be very close. Overflow occurs in both the empirical and theoretical coefs, and incrising m, n or d just makes it worse. Note that i did not manage to save the time returned by @btime, so timing might be wrong, but i still ran the thing several time to be shure everything was compiled properly (the package does some pre-computations on ech new type it encounters).

It is weird that arbfloat get a very poor result. I was expecting something more like bigfloat since the precision should be the same.