Poor performance of multivariate RNG in 1.0

I am updating some code form 0.5.2 to 1.0 and found large drops in performance, largely due to RNG

I generated 10M Float64 draws from N(0,1) and MVN(0,I(10)), with both rand() and rand!

After running everything a couple times to rule out LLVM windup, 1.0 performance was about comparable to 0.5 for univariate draws, but considerably worse for multivariate RNG.

10M MVN draws with rand:
0.5.3: 6.931799 seconds (50.00 M allocations: 2.831 GB, 5.02% gc time)
1.0.0: 26.845489 seconds (80.00 M allocations: 3.725 GiB, 1.24% gc time)

10M MVN draws with rand!:
0.5.3: 1.700891 seconds (8 allocations: 304 bytes)
1.0.0: 2.069793 seconds (11 allocations: 400 bytes)

Example code:

using Distributions
VERSION.major < 1 || using Random, LinearAlgebra;

rng_test(n,d) = for i in 1:n; x=rand(d); end

mv_n=10
mvn = VERSION.major < 1 ? MvNormal(convert(Matrix{Float64},diagm(fill(1,mv_n)))) : MvNormal(Matrix{Float64}(I,mv_n,mv_n));
its=10^7 #number of draws

arr_mv = VERSION.major < 1 ? Array{Float64,2}(length(mvn), its) : Array{Float64,2}(undef, length(mvn), its);
arr_uni=zeros(its);

@time rand!(Normal(), arr_uni); #uni mut
@time a=rng_test(its, Normal()); #uni new

@time rand!(mvn, arr_mv); #mv mut
@time b=rng_test(its,mvn); #mv new
@time rng_test(10^6,mvn); #mv new 1e6
2 Likes

Consider

  1. using BenchmarkTools.jl instead of @time to measure runtime only,

  2. running separate benchmarks simply for rand(Float64, ...), to see if the issue is coming from Distributions or the RNG implementation.

I can reproduce these results with BenchmarkTools.jl.

On v0.6.4

  81.089 ms (0 allocations: 0 bytes)
  71.954 ms (0 allocations: 0 bytes)
  914.587 ms (2 allocations: 48 bytes)
  4.679 s (30000000 allocations: 1.94 GiB)
  467.012 ms (3000000 allocations: 198.36 MiB)

On v1.0.0

  64.882 ms (0 allocations: 0 bytes)
  60.341 ms (0 allocations: 0 bytes)
  813.071 ms (7 allocations: 240 bytes)
  11.989 s (80000000 allocations: 3.73 GiB)
  1.202 s (8000000 allocations: 381.47 MiB)

After quite a bit of digging it appears to boil down to a change in behaviour of the Cholesky decomposition (via PDMats.jl unwhiten!).

On v0.6.4 we have

julia> c = cholfact(eye(10)) ;

julia> @btime $(c)[:UL]
  68.935 ns (2 allocations: 48 bytes)

whereas on v1.0.0 we have

julia> c = cholesky(Matrix{Float64}(I, 10, 10)) ;

julia> @btime $(c).UL
  772.167 ns (5 allocations: 208 bytes)

and when you multiply this by a few million (this is done for each multivariate sample) this adds up to quite a difference.

Anyone have any thoughts as to the solution?

4 Likes

OK, so the rabbit warren was somewhat deeper than I thought it was going to be… Interestingly this all comes down to a performance regression in the method string(Char). I’ve just filed an issue - see https://github.com/JuliaLang/julia/issues/28870 for details.

3 Likes

One final thing - you can patch this performance issue by including the function redefinition

function Base.getproperty(C::Cholesky, d::Symbol)
    Cfactors = getfield(C, :factors)
    Cuplo    = getfield(C, :uplo)
    info     = getfield(C, :info)
    if d == :U
        return UpperTriangular(Symbol(Cuplo) == d ? Cfactors : copy(Cfactors'))
    elseif d == :L
        return LowerTriangular(Symbol(Cuplo) == d ? Cfactors : copy(Cfactors'))
    elseif d == :UL
        return (Cuplo == 'U' ? UpperTriangular(Cfactors) : LowerTriangular(Cfactors))
    else
        return getfield(C, d)
    end
end

in your code. This gives the following results.

On v0.6.4

  81.089 ms (0 allocations: 0 bytes)
  71.954 ms (0 allocations: 0 bytes)
  914.587 ms (2 allocations: 48 bytes)
  4.679 s (30000000 allocations: 1.94 GiB)
  467.012 ms (3000000 allocations: 198.36 MiB)

On v1.0.0 (with performance fix)

  65.157 ms (0 allocations: 0 bytes)
  60.566 ms (0 allocations: 0 bytes)
  823.296 ms (3 allocations: 48 bytes)
  2.725 s (40000000 allocations: 1.94 GiB)
  272.875 ms (4000000 allocations: 198.36 MiB)

So v1.0.0 can actually go quite a bit faster!

6 Likes

Very nice investigation, thanks!

Nice job.

This is actually a rather scary performance regression since (as this thread demonstrates so well) it has so much potential to creep into other code in all sorts of places. I hope they prioritize it.

1 Like

Thanks for working this out and opening the issue!