Inlining of rand -- Vector vs nx1 Matrix

MeinAccount · May 5, 2021, 2:02pm

Using the latest Distributions and BenchmarkTools I observed the following weird / unexpected behavior on Julia 1.6.1 with -O3:

julia> @benchmark rand(Beta(2, 2), rand(Poisson(10)))
BenchmarkTools.Trial: 
  memory estimate:  561 bytes
  allocs estimate:  6
  --------------
  minimum time:     1.013 μs (0.00% GC)
  median time:      1.329 μs (0.00% GC)
  mean time:        1.362 μs (1.42% GC)
  maximum time:     68.390 μs (97.88% GC)
  --------------
  samples:          10000
  evals/sample:     24
julia> @benchmark rand(Beta(2, 2), rand(Poisson(10)), 1)
BenchmarkTools.Trial: 
  memory estimate:  149 bytes
  allocs estimate:  1
  --------------
  minimum time:     952.224 ns (0.00% GC)
  median time:      1.166 μs (0.00% GC)
  mean time:        1.174 μs (0.25% GC)
  maximum time:     16.207 μs (93.05% GC)
  --------------
  samples:          10000
  evals/sample:     58

Generating a Nx1 Matrix instead of a Vector seems to avoid multiple allocations and run faster. Looking at the output of @code_typed reveals that the compiler will inline the Gamma-sampling code in the second case only.

czylabsonasa · May 5, 2021, 3:02pm

similar differences - matrix version is faster by 20% and uses less allocs - can be observed in 1.4,1.5 and 1.6 but not in 1.0.
(I never change the default O level.)

cgeoga · May 5, 2021, 3:09pm

I think this is a benchmarking issue. Beta(2,2) and Poisson(10) are structs, and so you’re calling a constructor in each of those lines and benchmarking some stuff besides just the vector vs matrix question. On my computer (Fedora linux, COPR build 1.6.0), these benchmarks look different when I make each sampler a function:

sampler1() = rand(Beta(2,2), rand(Poisson(10)))    # 770.4 ns
sampler2() = rand(Beta(2,2), rand(Poisson(10)), 1) # 804.0 ns

The latter sampler does use fewer allocs, but oddly if I pre-allocate the Beta(2,2) and Poisson(10) and pass those in as const globals then the allocs become the same as well. So I think the answer here is just that this level of micro-benchmarking is tricky.

czylabsonasa · May 5, 2021, 7:14pm

I used an old school bench, including the script below

using Distributions

let
  function vec(n)
    d1=Beta(2, 2)
    d2=Poisson(10)
    le=0
    while true
      (le>n)&&break
      le+=length(rand(d1,rand(d2)))
    end
    le
  end

  function mat(n)
    d1=Beta(2, 2)
    d2=Poisson(10)
    le=0
    while true
      (le>n)&&break
      le+=length(rand(d1,(rand(d2),1)))
    end
    le
  end

  vec(10);
  mat(10);

  tmat,tvec=0.0,0.0
  for k in 1:16
    t=time()
    vec(2^k)
    tvec+=time()-t;

    t=time()
    mat(2^k)
    tmat+=time()-t;
  end
  println("mat:",tmat)
  println("vec:",tvec)

end

and got those 20% differences. Are there any constructor issue here?

czylabsonasa · May 5, 2021, 9:35pm

yes, if we move out d1,d2 from the functions and pass them as parameters the difference disappears…

MeinAccount · May 6, 2021, 5:58pm

You are totally right, I should have preallocated the sampler for the distributions. Both the vector case @benchmark rand($(sampler(Beta(2, 2))), rand($(sampler(Poisson(10))))) and the corresponding nx1 matrix version perform the same. Furthermore the performance is pretty much identical to the in-place version; with samplers preallocated.

macro sampler(exp)
    sampler(eval(exp))
end
function blub!(xs)
    N = rand(@sampler Poisson(10))
    resize!(xs, N)
    rand!((@sampler Beta(2, 2)), xs)
end

So the conclusion here seems to be that the compiler was only able to optimize the matrix version, but manually preallocating the samplers makes the difference not only go away, but also improves the performance.

Topic		Replies	Views
Misunderstanding Allocs for Monte Carlo summation Performance memory-allocation	3	91	June 20, 2024
Help to reduce number of allocations General Usage	23	5489	January 6, 2019
Why is a, b = rand(2) so bad? Performance	9	956	November 3, 2021
Generators vs loops vs broadcasting: Calculate PI via Monte Carlo Sampling Performance	7	1496	June 20, 2021
Product distribution allocates (a lot) New to Julia memory-allocation , distributions , random	9	162	March 11, 2025

Inlining of rand -- Vector vs nx1 Matrix

Related topics