Inlining of rand -- Vector vs nx1 Matrix

Using the latest Distributions and BenchmarkTools I observed the following weird / unexpected behavior on Julia 1.6.1 with -O3:

julia> @benchmark rand(Beta(2, 2), rand(Poisson(10)))
BenchmarkTools.Trial: 
  memory estimate:  561 bytes
  allocs estimate:  6
  --------------
  minimum time:     1.013 μs (0.00% GC)
  median time:      1.329 μs (0.00% GC)
  mean time:        1.362 μs (1.42% GC)
  maximum time:     68.390 μs (97.88% GC)
  --------------
  samples:          10000
  evals/sample:     24
julia> @benchmark rand(Beta(2, 2), rand(Poisson(10)), 1)
BenchmarkTools.Trial: 
  memory estimate:  149 bytes
  allocs estimate:  1
  --------------
  minimum time:     952.224 ns (0.00% GC)
  median time:      1.166 μs (0.00% GC)
  mean time:        1.174 μs (0.25% GC)
  maximum time:     16.207 μs (93.05% GC)
  --------------
  samples:          10000
  evals/sample:     58

Generating a Nx1 Matrix instead of a Vector seems to avoid multiple allocations and run faster. Looking at the output of @code_typed reveals that the compiler will inline the Gamma-sampling code in the second case only.

2 Likes

similar differences - matrix version is faster by 20% and uses less allocs - can be observed in 1.4,1.5 and 1.6 but not in 1.0.
(I never change the default O level.)

I think this is a benchmarking issue. Beta(2,2) and Poisson(10) are structs, and so you’re calling a constructor in each of those lines and benchmarking some stuff besides just the vector vs matrix question. On my computer (Fedora linux, COPR build 1.6.0), these benchmarks look different when I make each sampler a function:

sampler1() = rand(Beta(2,2), rand(Poisson(10)))    # 770.4 ns
sampler2() = rand(Beta(2,2), rand(Poisson(10)), 1) # 804.0 ns

The latter sampler does use fewer allocs, but oddly if I pre-allocate the Beta(2,2) and Poisson(10) and pass those in as const globals then the allocs become the same as well. So I think the answer here is just that this level of micro-benchmarking is tricky.

3 Likes

I used an old school bench, including the script below

using Distributions

let
  function vec(n)
    d1=Beta(2, 2)
    d2=Poisson(10)
    le=0
    while true
      (le>n)&&break
      le+=length(rand(d1,rand(d2)))
    end
    le
  end

  function mat(n)
    d1=Beta(2, 2)
    d2=Poisson(10)
    le=0
    while true
      (le>n)&&break
      le+=length(rand(d1,(rand(d2),1)))
    end
    le
  end

  vec(10);
  mat(10);

  tmat,tvec=0.0,0.0
  for k in 1:16
    t=time()
    vec(2^k)
    tvec+=time()-t;

    t=time()
    mat(2^k)
    tmat+=time()-t;
  end
  println("mat:",tmat)
  println("vec:",tvec)

end

and got those 20% differences. Are there any constructor issue here?

yes, if we move out d1,d2 from the functions and pass them as parameters the difference disappears…

You are totally right, I should have preallocated the sampler for the distributions. Both the vector case @benchmark rand($(sampler(Beta(2, 2))), rand($(sampler(Poisson(10))))) and the corresponding nx1 matrix version perform the same. Furthermore the performance is pretty much identical to the in-place version; with samplers preallocated.

macro sampler(exp)
    sampler(eval(exp))
end
function blub!(xs)
    N = rand(@sampler Poisson(10))
    resize!(xs, N)
    rand!((@sampler Beta(2, 2)), xs)
end

So the conclusion here seems to be that the compiler was only able to optimize the matrix version, but manually preallocating the samplers makes the difference not only go away, but also improves the performance.

2 Likes