DoubleFloats.jl -> Double128 | Quad64?

Agreeable, Arblib is very, very young and you have the great opportunity :smiley: to be the first serious user. One thing that is lacking is making it play nice with julia AbstractFloat protocol (? I know there is nothing formal like this, but maybe someone has hints ?), I opened an issue here: https://github.com/kalmarek/Arblib.jl/issues/101 please feel free to drop in and report missing bits and pieces :slight_smile:
Eg., You need to define Base.expm1(x::T) where T<:Arblib.ArbOrRef = Arblib.expm1!(T(prec=precision(x)), x). I’m sure there will be much more.

btw. instead of printing it’s much better to use @debug statements and enable/disable them setting ENV["JULIA_DEBUG"] = "ThorinDistributions"

Hey, you know what? I’ll try the upper example on Arblib and report issues right now.

I got these results running your example with the additional type Float128 from Quadmath.jl.

256 bits, 40x40
setprecision(BigFloat, 256)
setworkingprecision(ArbFloat, 256); setextrabits(0)

julia> display(Errors)
Dict{Type,Tuple{Float64,Float64,Float64}} with 6 entries:
Float64 => (1.01502e20, 0.277835, 1.37131e8)
ArbFloat => (-1.5526e-6, 1.94999, 2.22239e8)
Double64 => (0.000136315, 0.352428, 2.7385e8)
BigFloat => (0.000182493, 1.53269, 2.9704e8)
Float128 => (6.09923e-5, 0.510003, 2.95367e8)

There was a mistake in the code I posted, the Radon.seed!() Calls needs to be inside the loop otherwise we are comparing it on different data.

I glimpsed over MultivariateGammaConvolution and the first thing I see is that you use abstract fields:
https://github.com/lrnv/ThorinDistributions.jl/blob/735be65ec76ca98968ac0e7f7bf38805e7a52f89/src/MultivariateGammaConvolution.jl#L21
it’d be much better to parametrize it concretely:

struct MultivariateGammaConvolution{T<:Real, V<:AbstractVector{T}, M<:AbstractMatrix{T}, XXX} <: Distributions.ContinuousMultivariateDistribution where T
    α::V
    θ::M
    constants::XXX
end

then you can turn some functions into static ones:

#### eltype, length, support
eltype(d::MultivariateGammaConvolution{T, V}) where {T,V} = V
Base.length(d::MultivariateGammaConvolution) = size(d.θ,2)
Distributions.insupport(d::MultivariateGammaConvolution, x) = all(>(zero(eltype(x)), x)
    # or all(>(zero(first(x))), x) ← this will probably be better when using Arbs that come with their own precision

#### Sampling
struct MGCSPL <: Distributions.Sampleable{Distributions.Multivariate,Distributions.Continuous}
    Γs::Vector{MGSPL} # is MGSPL concrete?
end
Distributions.sampler(d::MultivariateGammaConvolution) = MGCSPL([Distributions.sampler(MultivariateGamma(d.α[i],d.θ[i,:])) for i in 1:length(d.α)])

and so on

1 Like

Thanks for taking the time !

This is not the first time i hear that. Could you explain to me / point me to the doc where it is explained why ?

Following your advice, i’ll probably rewrite the whole package :wink:

Performance Tips

1 Like

So, use Float128 from Quadmath.jl.

Yeah no. As i said, 128 bits is not enough. There was a mistake in my MWE, we were comparing on differen data. This is fixed in this version :

MWE (click to show)
# You should ]add github.com/lrnv/ThorinDistributions.jl
import Random, ThorinDistributions

using DoubleFloats
using ArbNumerics
using Quadmath
setprecision(256)
setprecision(ArbFloat,256)

types = (Float64,
         Double64, 
         Float128, 
         BigFloat, 
         ArbFloat
         )

n = 20
d = 2
m = Tuple(repeat([80],d))
N = 1000

Random.seed!(1234)
α = rand(n)
θ = reshape(10 .* rand(n*d),(n,d))

init = (0.0, 0.0, 0.0)
init_val = (zeros(m),zeros(m))

Values = Dict(T => init_val for T in types)
Errors = Dict(T => init for T in types)
for T in types
    print(T,"\n")

    alpha = T.(α);
    scales = T.(θ);
    dist = ThorinDistributions.MultivariateGammaConvolution(alpha,scales);
    samples = zeros(T,(2,N));
    Random.seed!(123);
    Random.rand!(dist,samples);

    E = ThorinDistributions.empirical_coefs(samples,m); # empirical coefs
    coefs = ThorinDistributions.get_coefficients(alpha, scales,m); # theoreticla coefs
    aloc = @allocated ThorinDistributions.get_coefficients(alpha,scales,m);
    time = Base.@elapsed(ThorinDistributions.get_coefficients(alpha,scales,m));
    

    # All these should have the same magnitude, very small.
    Errors[T] = (Float64(sum((E - coefs)^2)),time,aloc);
    Values[T] = (Float64.(E), Float64.(coefs));
    print(Errors[T],"\n")
end

# Type => (Error, time1, allocations)
display(Errors)

ArbFloat is not working proprerly at the moment (it’s probably my fault), but i do have the following results (after a few runs to get everything compiled):

  Float128 => (4.51278e74, 1.77423, 467600.0)
  BigFloat => (2.48274e-8, 11.7816, 8.16053e9)
  Double64 => (2.10301e78, 0.470318, 467600.0)
  Float64  => (2.68084e110, 0.121068, 286112.0)

11 seconds to evaluate the cost function is simply unnacceptble, I cannot afford this (Meaning: this is big enough that i can spend a few weeks trying to solve this issue :rofl: ). Therefore the problem remains. As you said earlyer, this may be fixable by an algorithm change, so that is what i’m trying to do right now (without a lot of success…); The source of overflow is in the ThorinDistibutions.build_coefficients! function if you wanna take a look.

One thing you could do is use view or @view (see online help) with your precomputed matricies and vectors. Currently, it appears that your code is doing a good deal of copying that can be avoided this way. Create the views where you precompute them, and use the names of those views rather than the names of the underlying arrays in your computational loops.

Okay, seems like a good idea. Anyway, the main bottneck of my code, when running in bigfloats, is Base.MPFR.*, taking approx 80% of my runtime.

I am trying to change the algorithm to both avoid overflow and reduce the number of multiplications, without success for the moment.

I don’t know how much of that overhead is from quietly copying BigFloats; if it is substantial, using views may help overall.

Looks like most of the overhead is in the bigfloat products, copying overhead happens inside the bigfloat multiplication. Anayway a dirty @view fix reduced the runtime by 5-8%, which is still something.

Okay, i might have something. Using MultiFloats.jl, with a personal (dirty) fix for exp / log, i have the following :stuck_out_tongue:

Code that I ran
# ]add github.com/lrnv/ThorinDistributions.jl
# ]add https://github.com/lrnv/MultiFloats.jl
import Random, ThorinDistributions

using DoubleFloats
using ArbNumerics
using Quadmath
using MultiFloats
setprecision(256)
setprecision(ArbFloat,256)

types = (Float64,
         Double64, 
         Float128, 
         BigFloat, 
         Float64x1,
         Float64x2,
         Float64x3,
         Float64x4,
         Float64x5,
         Float64x6,
         Float64x7,
         Float64x8
         )

n = 20
d = 2
m = Tuple(repeat([80],d))
N = 1000

Random.seed!(1234)
α = rand(n)
θ = reshape(10 .* rand(n*d),(n,d))

init = (0.0, 0.0, 0.0)
init_val = (zeros(m),zeros(m))

Values = Dict(T => init_val for T in types)
Errors = Dict(T => init for T in types)
for T in types
    print(T,"\n")

    alpha = T.(α);
    scales = T.(θ);
    dist = ThorinDistributions.MultivariateGammaConvolution(alpha,scales);
    samples = zeros(T,(2,N));
    Random.seed!(123);
    Random.rand!(dist,samples);

    E = ThorinDistributions.empirical_coefs(samples,m); # empirical coefs
    coefs = ThorinDistributions.get_coefficients(alpha, scales,m); # theoreticla coefs
    aloc = @allocated ThorinDistributions.get_coefficients(alpha,scales,m);
    time = Base.@elapsed(ThorinDistributions.get_coefficients(alpha,scales,m));
    

    # All these should have the same magnitude, very small.
    Errors[T] = (Float64(sum((E - coefs)^2)),time,aloc);
    Values[T] = (Float64.(E), Float64.(coefs));
    print(Errors[T],"\n")
end

# Type => (Error, time1, allocations)
display(Errors)

Which gives me the following result :

  Float64                 => (2.68084e110, 0.121672, 286112.0)
  Float128                => (4.51278e74, 1.84456, 467600.0)
  Double64                => (2.10301e78, 0.516504, 467600.0)
  MultiFloat{Float64,1} => (4.1367e109, 0.124149, 296656.0)
  MultiFloat{Float64,2} => (3.84925e78, 0.26139, 486208.0)
  MultiFloat{Float64,3} => (-6.13933e44, 0.66715, 676464.0)
  MultiFloat{Float64,4} => (3.1583e14, 1.33451, 864800.0)
  MultiFloat{Float64,5} => (2.367e-8, 2.39511, 1.05234e6)
  MultiFloat{Float64,6} => (2.36701e-8, 4.46531, 1.25013e6)
  MultiFloat{Float64,7} => (2.36701e-8, 6.50383, 1.44102e6)
  MultiFloat{Float64,8} => (2.36701e-8, 9.84249, 1.63157e6)
  BigFloat              => (2.48274e-8, 12.6092, 8.16053e9)

It seems like the Float64x5 might be enough for this case. The dirty fix i used is that epx/log were not implemented, so i used a convertion to bigfloat back and forth to use the bigfloat implementation. @JeffreySarnoff Do you know if there is some cleaner algorithm that could be implemented for exp/log in this library ?

Here is a faster way to multiply BigFloats (it avoids the internal allocation).

function mul!(z::BigFloat, x::BigFloat, y::BigFloat)
  ccall(("mpfr_mul",:libmpfr), Int32, (Ref{BigFloat}, Ref{BigFloat}, Ref{BigFloat}, Base.MPFR.MPFRRoundingMode), z, x, y, Base.MPFR.ROUNDING_MODE[])
  return z
end
mul! (generic function with 1 method)

# or

function mul!(z::BigFloat, x::BigFloat, y::BigFloat)
  ccall(("mpfr_mul",:libmpfr), Int32, (Ref{BigFloat}, Ref{BigFloat}, Ref{BigFloat}, Base.MPFR.MPFRRoundingMode), z, x, y, Base.MPFR.ROUNDING_MODE[])
  return nothing
end

z=BigFloat(); x = BigFloat(pi); y = sqrt(BigFloat(2));
mul!(z, x, y)
z

(about 1.3x faster on a single multiply)

2 Likes

This is the first I have seen MultiFloats.jl. The algorithms for exp and log are interdependent and, for extended precision, generally difficult because the algebraic equivalences/relationships do not hold in floating point representations.

Seems great. Is there some equivalent hack for the addmul of 3 or 4 bigfloats at once ? The main overhead of my code is in two lines, of the form x += z * y * t * u and x += z * y * t.

Here is the equivalent for +:

function add!(z::BigFloat, x::BigFloat, y::BigFloat)
  ccall(("mpfr_add",:libmpfr), Int32, (Ref{BigFloat}, Ref{BigFloat}, Ref{BigFloat}, Base.MPFR.MPFRRoundingMode), z, x, y, Base.MPFR.ROUNDING_MODE[])
  return z
end
1 Like

This gives you z = fma(x,y,k)

function fma!(z::BigFloat, x::BigFloat, y::BigFloat, k::BigFloat)
  ccall(("mpfr_fma",:libmpfr), Int32, (Ref{BigFloat}, Ref{BigFloat}, Ref{BigFloat}, Ref{BigFloat}, Base.MPFR.MPFRRoundingMode), z, x, y, k, Base.MPFR.ROUNDING_MODE[])
  return z
end

This gives you z = x1y1 + x2y2 (works like fma does)

function fmma!(z::BigFloat, x1::BigFloat, y1::BigFloat, x2::BigFloat, y2::BigFloat)
  ccall(("mpfr_fmma",:libmpfr), Int32, (Ref{BigFloat}, Ref{BigFloat}, Ref{BigFloat}, Ref{BigFloat}, Ref{BigFloat}, Base.MPFR.MPFRRoundingMode), z, x1, y1, x2, y2, Base.MPFR.ROUNDING_MODE[])
  return z
end

That’s all there is in the library that works directly.

1 Like

So for the rest i should compose these ones, right ?