I am updating some code form 0.5.2 to 1.0 and found large drops in performance, largely due to RNG
I generated 10M Float64 draws from N(0,1) and MVN(0,I(10)), with both rand() and rand!
After running everything a couple times to rule out LLVM windup, 1.0 performance was about comparable to 0.5 for univariate draws, but considerably worse for multivariate RNG.
10M MVN draws with rand:
0.5.3: 6.931799 seconds (50.00 M allocations: 2.831 GB, 5.02% gc time)
1.0.0: 26.845489 seconds (80.00 M allocations: 3.725 GiB, 1.24% gc time)
10M MVN draws with rand!:
0.5.3: 1.700891 seconds (8 allocations: 304 bytes)
1.0.0: 2.069793 seconds (11 allocations: 400 bytes)
Example code:
using Distributions
VERSION.major < 1 || using Random, LinearAlgebra;
rng_test(n,d) = for i in 1:n; x=rand(d); end
mv_n=10
mvn = VERSION.major < 1 ? MvNormal(convert(Matrix{Float64},diagm(fill(1,mv_n)))) : MvNormal(Matrix{Float64}(I,mv_n,mv_n));
its=10^7 #number of draws
arr_mv = VERSION.major < 1 ? Array{Float64,2}(length(mvn), its) : Array{Float64,2}(undef, length(mvn), its);
arr_uni=zeros(its);
@time rand!(Normal(), arr_uni); #uni mut
@time a=rng_test(its, Normal()); #uni new
@time rand!(mvn, arr_mv); #mv mut
@time b=rng_test(its,mvn); #mv new
@time rng_test(10^6,mvn); #mv new 1e6
2 Likes
I can reproduce these results with BenchmarkTools.jl.
On v0.6.4
81.089 ms (0 allocations: 0 bytes)
71.954 ms (0 allocations: 0 bytes)
914.587 ms (2 allocations: 48 bytes)
4.679 s (30000000 allocations: 1.94 GiB)
467.012 ms (3000000 allocations: 198.36 MiB)
On v1.0.0
64.882 ms (0 allocations: 0 bytes)
60.341 ms (0 allocations: 0 bytes)
813.071 ms (7 allocations: 240 bytes)
11.989 s (80000000 allocations: 3.73 GiB)
1.202 s (8000000 allocations: 381.47 MiB)
After quite a bit of digging it appears to boil down to a change in behaviour of the Cholesky decomposition (via PDMats.jl unwhiten!
).
On v0.6.4 we have
julia> c = cholfact(eye(10)) ;
julia> @btime $(c)[:UL]
68.935 ns (2 allocations: 48 bytes)
whereas on v1.0.0 we have
julia> c = cholesky(Matrix{Float64}(I, 10, 10)) ;
julia> @btime $(c).UL
772.167 ns (5 allocations: 208 bytes)
and when you multiply this by a few million (this is done for each multivariate sample) this adds up to quite a difference.
Anyone have any thoughts as to the solution?
4 Likes
OK, so the rabbit warren was somewhat deeper than I thought it was going to be… Interestingly this all comes down to a performance regression in the method string(Char)
. I’ve just filed an issue - see https://github.com/JuliaLang/julia/issues/28870 for details.
3 Likes
One final thing - you can patch this performance issue by including the function redefinition
function Base.getproperty(C::Cholesky, d::Symbol)
Cfactors = getfield(C, :factors)
Cuplo = getfield(C, :uplo)
info = getfield(C, :info)
if d == :U
return UpperTriangular(Symbol(Cuplo) == d ? Cfactors : copy(Cfactors'))
elseif d == :L
return LowerTriangular(Symbol(Cuplo) == d ? Cfactors : copy(Cfactors'))
elseif d == :UL
return (Cuplo == 'U' ? UpperTriangular(Cfactors) : LowerTriangular(Cfactors))
else
return getfield(C, d)
end
end
in your code. This gives the following results.
On v0.6.4
81.089 ms (0 allocations: 0 bytes)
71.954 ms (0 allocations: 0 bytes)
914.587 ms (2 allocations: 48 bytes)
4.679 s (30000000 allocations: 1.94 GiB)
467.012 ms (3000000 allocations: 198.36 MiB)
On v1.0.0 (with performance fix)
65.157 ms (0 allocations: 0 bytes)
60.566 ms (0 allocations: 0 bytes)
823.296 ms (3 allocations: 48 bytes)
2.725 s (40000000 allocations: 1.94 GiB)
272.875 ms (4000000 allocations: 198.36 MiB)
So v1.0.0 can actually go quite a bit faster!
6 Likes
Very nice investigation, thanks!
Nice job.
This is actually a rather scary performance regression since (as this thread demonstrates so well) it has so much potential to creep into other code in all sorts of places. I hope they prioritize it.
1 Like
Thanks for working this out and opening the issue!