I am trying to draw a random index from a vector of probabilities. Here is my code:
import Random
using Random: AbstractRNG
struct FastCategorical
_cdf::Vector{Float64}
function FastCategorical(pdf)
cdf = cumsum(pdf)
@assert last(cdf) ≈ 1
new(cdf)
end
end
@inline function Random.rand(rng::AbstractRNG, o::FastCategorical)
p = rand(rng)
searchsortedfirst(o._cdf, p)
end
using BenchmarkTools
using LinearAlgebra
pdf = normalize!(rand(10^5), 1)
d = FastCategorical(pdf)
rng = Random.GLOBAL_RNG
@btime rand($rng, $d) # 128.708 ns (0 allocations: 0 bytes)
@btime searchsortedfirst($(d._cdf), x) setup=(x = rand()) # 39.083 ns (0 allocations: 0 bytes)
@btime rand($(rng)) # 2.865 ns (0 allocations: 0 bytes)
Observe that rand
is much slower then running both its lines individually.
Apparently this problem has little to do with your structure of code
using BenchmarkTools, LinearAlgebra
function test()
cdf = cumsum(normalize!(rand(10^5), 1))
println("Summed times")
@btime searchsortedfirst($cdf, rand())
println("Single times")
@btime searchsortedfirst($cdf, x) setup=(x=rand())
@btime rand()
end
test()
--------
Summed times
187.690 ns (0 allocations: 0 bytes)
Single times
56.369 ns (0 allocations: 0 bytes)
4.125 ns (0 allocations: 0 bytes)
:: Thanks to @giordano for making the testing rigorous
1 Like
julia> function test3()
cdf = cumsum(normalize!(rand(10^5), 1))
f=rand();
@btime searchsortedfirst($cdf, $(rand()))
@btime searchsortedfirst($cdf, $f)
@btime searchsortedfirst($cdf, $(pi/10))
@btime rand()
end
test3 (generic function with 1 method)
julia> test3()
44.017 ns (0 allocations: 0 bytes)
44.198 ns (0 allocations: 0 bytes)
44.163 ns (0 allocations: 0 bytes)
3.012 ns (0 allocations: 0 bytes)
1 Like
CF PSA: Microbenchmarks remember branch history
It is even worse here, since your cdf does not fit into L1. You can probably get a significant speedup by using a cache-oblivious layout of the implicit search tree, instead of a linearly ordered vector.
Unfortunately, I am not aware of any julia packages implementing search-optimized layout of sorted lists.
2 Likes
@giordano thanks! my mistake, pi/3 also must be interpolated,
but rand() was intentionally uninterpolated, in order for it to be calculated every time
Edited:
@foobar_lv2 explanatory post is right on the money.
i was oversimplifying too much, and ended up making the test pointless, i edited according to your suggestions, thanks
Sidestepping the whole optimization question: if you want a lot of draws from a discrete/categorical distribution, it is worth building an alias table.
It is implemented in Distributions.jl.
2 Likes
But why is the @btime searchsortedfirst($cdf, x) setup=(x=rand())
fast? You tell me that the CPU learns to predict searchsortedfirst
branches from the rand
call??
Because you also need to set evals=1
. This is imo a design flaw in the BenchmarkTools API, but currently that’s how it works.
julia> @benchmark searchsortedfirst($cdf, x) setup=(x=rand())
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 66.402 ns (0.00% GC)
median time: 71.518 ns (0.00% GC)
mean time: 71.189 ns (0.00% GC)
maximum time: 152.288 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 978
julia> @benchmark searchsortedfirst($cdf, x) setup=(x=rand()) evals=1
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 149.000 ns (0.00% GC)
median time: 283.000 ns (0.00% GC)
mean time: 297.651 ns (0.00% GC)
maximum time: 11.499 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> g(a)=searchsorted(a, rand())
g (generic function with 2 methods)
julia> @benchmark g($cdf)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 238.247 ns (0.00% GC)
median time: 245.880 ns (0.00% GC)
mean time: 249.613 ns (0.00% GC)
maximum time: 5.811 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 425
2 Likes