Julia vs. R Speed for sampling the Wishart distribution

Hi:
I was wondering if it is possible to speed up the following. I have found it takes much longer than R. The difference is not small, as in Julia this take 11 seconds and in R it takes only 1.5 seconds.

    function timing(samples)

     @timed  for i in 1:samples
     inv_wish = rand(InverseWishart(20, eye(20)))
    end

   end

timing(50000)
system.time(
replicate(50000, rWishart(1,  df = 20,Sigma = diag(20))  )
)

It’s not a hundred percent clear to me what you’re trying to do. Is it:

using Distributions, LinearAlgebra

function timing(samples)
    for i in 1:samples
        # note that eye() has been deprecated in Julia 0.7
        inv_wish = rand(InverseWishart(20, Matrix{Float64}(1.0I, 20, 20)))
    end
end

# for benchmarking purposes it's best practice to use the BenchmarkTools package
using BenchmarkTools
@btime timing(5000)
# returns 928.716 ms (387181 allocations: 165.67 MiB)

Which version of Julia are you using? The fact that you’re using eye() to construct the identity suggests that you might want to consider updating!

3 Likes

Hi:
I am making a gibbs sampler, and need to draw many samples from the Wishart distribution. So in the sampler, which is currently written in R, I need to sample from the Wishart. And for some reason it seemed to be faster in R, although the Wishart function in R is coded in C.

Can you share your R and Julia code so we can compare?

Be sure to quote your code in backticks ``` so its easier to read.

I wouldn’t say although the Wishart function in R is coded in C, I would rather suspect that the only reason R can achieve decent speed is that the function is coded in R (rather than in C!)

If you’re just interested in the speed of drawing a random sample from the Inverse Wishart you can just do

julia> using Distributions, LinearAlgebra, BenchmarkTools

julia> @btime rand(InverseWishart(20, Matrix{Float64}(1.0I, 20, 20)))
116.513 μs (65 allocations: 33.73 KiB)

EDIT I just noticed I got the benchmark above wrong by a factor of ten as I misread your code, so it takes 9.3s to run rather than 0.9 (this compares to 2.5s for running your R code on my machine)

You’re re-initializing the sampler for each sample. Just create it once, and reuse it as needed.

using Distributions, BenchmarkTools, LinearAlgebra

m = InverseWishart(20, Matrix{Float64}(1.0I, 20, 20))
@btime rand($m)   # 26.702 μs (45 allocations: 20.30 KiB)

Multiplied by 50,000 samples, this comes out to ~1.3 seconds, which is right on the mark with C. If R’s just calling a well-optimized C implementation, you’re not likely to see a huge speed-up on a single core in Julia (although this sort of routine would be very easy to parallelize).

9 Likes

Wow I’ve got a terribly slow machine apparently - even if I interpolate the sampler in @btime I still get 105μs (45 allocations: 20.31 KiB)!

I have the code written in R, and was planning to write it in Julia. I have written some things in Julia, and noticed considerable speed gains but not when I had to sample from the Wishart distribution for a covariance matrix. So figured I’d ask to see if there is a faster implementation before writing it in Julia.

Indeed, but I would think the problem would be that R provides the sample from the Wishart, whereas this does not. It would still need rand() for each iteration.

To follow up, for the Gibbs sampler, I cannot have one initialized unity matrix, but need to draw a sample with a new matrix for each iteration, of which there will be thousands.

@stillyslalom are you able to run R code on your machine as well? I’m not sure whether the benchmarks are doing the same thing, but I find a considerable speed difference doing:

  1. In Julia:
julia> using BenchmarkTools, Distributions, LinearAlgebra

julia> m = InverseWishart(20, Matrix{Float64}(1.0I, 20, 20))

julia> @btime for i ∈ 1:50_000 rand($m) end
8.098 s (2928004 allocations: 1001.40 MiB)

julia> versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = "C:\ProgramData\LONNG9\atom\app-1.34.0\atom.exe" -a
  JULIA_NUM_THREADS = 2
  1. In R:
> system.time(replicate(50000, rWishart(1, df = 20,Sigma = diag(20))))
   user  system elapsed 
   2.03    0.17    2.22

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

So not quite the 8-fold gap that OP found but still a difference of factor 3.6 - I don’t know enough about benchmarking in R to judge whether this is a useful comparison but it roughly looks like the scripts are doing the same thing.

Maybe some of the benchmarking gurus like @bennedich or @foobar_lv2 or Distributions experts like @simonbyrne can weigh in on whether this is an artefact of measuring the wrong thing or there’s an actual difference in sampling speed!

1 Like

Not so sure. While on the surface they are kind of doing the same thing, but the Julia implementation seems to be set to an identity matrix for all matrices generated. In my R-code this is happening as well, but one could just as easily use any matrix for the scale matrix, whereas this would take recalling the InverseWishart() with a new Scale matrix (as for each iteration in my Gibbs sampler, the Scale matrix will be different)

This is still benchmarking a loop in global scope. Just do @btime rand($m) and multiply the resulting time by 50000 — the @btime macro already does a timing loop for you.

That being said, there are certainly C library routines exposed in R that are faster than existing library routines in Julia, simply because it takes a lot of engineering effort to optimize new implementations to compete with existing highly tuned C functions.

2 Likes

Fair - would this be better:

function samples()
    m = InverseWishart(20, Matrix{Float64}(1.0I, 20, 20))
    for i ∈ 1:50_000
        rand(m) 
    end
end

@btime samples()
7.906 s (2932796 allocations: 1001.49 MiB)

First of all, note that InverseWishart sampling is slower than Wishart sampling!
Wisharts can be sampled via the Bartlett’s decomposition, as I do below. Simply multiplying the decomposition by itself generates the Wishart. For the InverseWishart, you also need to invert the matrix.
That is my approach below. I also use a different (faster) gamma random sampler than that provided by Distributions.jl, to generate the diagonal elements of the Bartlett decomposition.

First, baseline time:

julia> using BenchmarkTools, Distributions, LinearAlgebra

julia> @benchmark rand(InverseWishart(20, Matrix{Float64}(I,20,20)))
BenchmarkTools.Trial: 
  memory estimate:  33.50 KiB
  allocs estimate:  48
  --------------
  minimum time:     46.890 μs (0.00% GC)
  median time:      56.950 μs (0.00% GC)
  mean time:        64.514 μs (10.63% GC)
  maximum time:     46.167 ms (99.76% GC)
  --------------
  samples:          10000
  evals/sample:     1

Now, a faster inverse Wishart:

using Random, LinearAlgebra

@inline randgamma(α) = randgamma(Random.GLOBAL_RNG, α)
@inline function randgamma(rng::AbstractRNG, α::T) where T
    α < one(T) ? exp(-randexp(rng, T)/α) * randgamma_g1(rng, α+one(T)) : randgamma_g1(rng, α)
end
@inline function randgamma_g1(rng::AbstractRNG, α::T) where {T}
    OneThird = one(T)/T(3)
    d = α - OneThird
    @fastmath c = OneThird / sqrt(d)
    @fastmath while true
        x = randn(rng, T)
        v = one(T) + c*x
        v < zero(T) && continue
        v3 = v^3
        dv3 = d*v3
        randexp(rng, T) > T(-0.5)*x^2 - d + dv3 - d*log(v3) && return dv3
    end
end
@inline randchisq(rng::AbstractRNG, ν::T) where {T} = T(2.0) * randgamma(rng, T(0.5)ν)
@inline randchisq(ν::T) where {T} = T(2.0) * randgamma(T(0.5)ν)
@inline randchisq(rng::AbstractRNG, ν::T) where {T <: Integer} = T(2) * randgamma(rng, ν/2)
@inline randchisq(ν::T) where {T <: Integer} = T(2) * randgamma(ν/2)
@inline randchi(rng::AbstractRNG, ν) = @fastmath sqrt(randchisq(rng, ν))
@inline randchi(ν) = @fastmath sqrt(randchisq(ν))



function randinversewishart(ν, N, Σ::UniformScaling{Bool})
    I.λ == false && return zeros(N,N)

    out = Matrix{Float64}(undef, N, N)
    for nc ∈ 1:N
        out[nc,nc] = randchi(ν - nc + 1)
        for nr ∈ nc+1:N
            out[nr, nc] = randn()
        end
    end
    Symmetric(LinearAlgebra.LAPACK.potri!('L', out), :L)
end

and a Wishart:

function randwishart(ν, N, Σ::UniformScaling)
    A = zeros(N, N)
    for nc ∈ 1:N
        A[nc,nc] = randchi(ν - nc + 1)
        for nr ∈ nc+1:N
            A[nr, nc] = randn()
        end
    end
    W = Matrix{Float64}(undef, N, N)
    LinearAlgebra.BLAS.syrk!('L', 'N', Float64(I.λ), A, 0.0, W)
    Symmetric(W, :L)
end

These can be optimized further, but we already get a decent speed boost, without needing anything outside of the standard library:

julia> @benchmark randinversewishart(20, 20, I)
BenchmarkTools.Trial: 
  memory estimate:  4.22 KiB
  allocs estimate:  62
  --------------
  minimum time:     10.810 μs (0.00% GC)
  median time:      15.220 μs (0.00% GC)
  mean time:        21.230 μs (27.15% GC)
  maximum time:     46.320 ms (99.92% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark randwishart(20, 20, I)
BenchmarkTools.Trial: 
  memory estimate:  7.47 KiB
  allocs estimate:  63
  --------------
  minimum time:     5.357 μs (0.00% GC)
  median time:      5.940 μs (0.00% GC)
  mean time:        7.267 μs (16.55% GC)
  maximum time:     7.478 ms (99.85% GC)
  --------------
  samples:          10000
  evals/sample:     6

Or, running the timing functions:

julia> function timing1(samples)
           @time  for i in 1:samples
               inv_wish = rand(InverseWishart(20, Matrix{Float64}(I,20,20)))
           end
       end
timing1 (generic function with 1 method)

julia> timing1(50000)
  3.008724 seconds (3.03 M allocations: 1.607 GiB, 2.48% gc time)

julia> function timing2(samples)
           @time  for i in 1:samples
               inv_wish = randinversewishart(20, 20, I)
           end
       end
timing2 (generic function with 1 method)

julia> timing2(50000)
  0.789239 seconds (3.10 M allocations: 205.994 MiB, 1.43% gc time)

julia> function timing3(samples)
           @time  for i in 1:samples
               inv_wish = randwishart(20, 20, I)
           end
       end
timing3 (generic function with 1 method)

julia> timing3(50000)
  0.316246 seconds (3.15 M allocations: 364.685 MiB, 5.36% gc time)
8 Likes

Just a note that if we’re timing 50,000 invocations in R, we should do the same in Julia. Doing @btime rand($m) and multiplying the result by 50,000 is not fair, since @btime returns the minimum time of many runs (10,000 runs for fast expressions), which can be a lot faster than the average run time:

julia> @btime for i = 1:50_000 rand($m) end
  1.925 s (2929875 allocations: 1001.43 MiB)

julia> @btime rand($m);
  23.484 μs (46 allocations: 20.31 KiB)

julia> "$(50_000 * 23.484 / 1e6) s"
"1.1742 s"

@benchmark shows the difference between fastest and average run times:

julia> @benchmark rand($m)
BenchmarkTools.Trial:
  memory estimate:  20.31 KiB
  allocs estimate:  46
  --------------
  minimum time:     23.794 μs (0.00% GC)
  median time:      32.589 μs (0.00% GC)
  mean time:        41.307 μs (14.05% GC)
  maximum time:     42.114 ms (99.89% GC)
  --------------
  samples:          10000
  evals/sample:     1
1 Like

I believe that @btime returns the average time, not the minimum.

Although your results seem to show something different…

I’m pretty sure it’s minimum:

help?> @btime
  @btime expression [other parameters...]
  ...
  The printed time is the *minimum* elapsed time measured during the benchmark.

And source BenchmarkTools.jl/execution.jl at master · JuliaCI/BenchmarkTools.jl · GitHub

BenchmarkTools.minimum($trial)
1 Like

I’m sure the sampling of those could be improved. PRs welcome!