AliasTables.jl

Lilith · April 12, 2024, 3:58pm

Heyo!

I’m announcing a state of the art high performance implementation of the alias method for categorical non-uniform random sampling over 1:n with O(1) sampling time and O(n) setup time.

Docs: https://aliastables.lilithhafner.com
Source: GitHub - LilithHafner/AliasTables.jl: An efficient sampler for discrete random variables

Enjoy

Happy to hear feedback and pointers to other implementations of this algorithm

Here’s a performance teaser

julia> using AliasTables, Chairmarks

julia> at = AliasTable(rand(17));

julia> @b rand($at)
2.878 ns

julia> @b rand(UInt64)
2.723 ns

julia> @b rand(1:17)
3.414 ns

TheCedarPrince · April 12, 2024, 4:32pm

Oh my gosh, new @Lilith package just dropped!

Out of curiosity, how does this compare to the sampling methods over in Statistics or StatsBase? I have no idea how they compare or how I would make a comparison so apologies for the vague question. I tend to just use the sample method from StatsBase so was curious.

Always excited to see whenever you have a new experimental package up!

Cheers,

~ tcp

vancleve · April 12, 2024, 8:14pm

Its a very nice improvement!

github.com/JuliaStats/StatsBase.jl

Use a faster and safer implementation of `alias_sample!`

JuliaStats:master ← LilithHafner:lh/alias-tables

opened 02:18PM - 08 Apr 24 UTC

LilithHafner

+76 -73

Safer: Before ```julia julia> StatsBase.alias_sample!(rand(10), weights(r…andn(10)), rand(10)) 10-element Vector{Float64}: 0.5676653052762575 0.5676653052762575 0.5676653052762575 0.5676653052762575 0.5676653052762575 0.5676653052762575 0.1984287484280587 0.5676653052762575 0.1984287484280587 0.8567391687334422 julia> StatsBase.alias_sample!(rand(10), weights(fill(0, 10)), rand(10)) ERROR: BoundsError: attempt to access 10-element Vector{Float64} at index [281471800382896] # This came from reading undef memory Stacktrace: [1] throw_boundserror(A::Vector{Float64}, I::Tuple{Int64}) @ Base ./essentials.jl:14 [2] getindex @ ./essentials.jl:891 [inlined] [3] alias_sample!(rng::TaskLocalRNG, a::Vector{Float64}, wv::Weights{Int64, Int64, Vector{Int64}}, x::Vector{Float64}) @ StatsBase ~/.julia/packages/StatsBase/ebrT3/src/sampling.jl:729 [4] top-level scope @ REPL[10]:1 julia> StatsBase.alias_sample!(rand(10), weights(fill(0, 10)), rand(10)) # Got "lucky" this time 10-element Vector{Float64}: 0.07577419536126007 0.9233876530591941 0.1530016664475169 0.07577419536126007 0.07577419536126007 0.3159766423652197 0.1530016664475169 0.6780730450968911 0.01012788415877619 0.6780730450968911 ``` After ```julia julia> StatsBase.alias_sample!(rand(10), weights(randn(10)), rand(10)) ERROR: ArgumentError: found negative weight -0.1164833812103052 Stacktrace: [1] checked_sum @ ~/.julia/packages/AliasTables/yt2Qj/src/AliasTables.jl:437 [inlined] [2] AliasTables.AliasTable{UInt64, Int64}(weights::Weights{Float64, Float64, Vector{Float64}}; _normalize::Bool) @ AliasTables ~/.julia/packages/AliasTables/yt2Qj/src/AliasTables.jl:81 [3] AliasTable @ ~/.julia/packages/AliasTables/yt2Qj/src/AliasTables.jl:78 [inlined] [4] AliasTable @ ~/.julia/packages/AliasTables/yt2Qj/src/AliasTables.jl:76 [inlined] [5] alias_sample!(rng::TaskLocalRNG, a::Vector{Float64}, wv::Weights{Float64, Float64, Vector{Float64}}, x::Vector{Float64}) @ StatsBase ~/.julia/dev/StatsBase/src/sampling.jl:719 [6] top-level scope @ REPL[33]:1 julia> StatsBase.alias_sample!(rand(10), weights(fill(0, 10)), rand(10)) ERROR: ArgumentError: all weights are zero Stacktrace: [1] checked_sum @ ~/.julia/packages/AliasTables/yt2Qj/src/AliasTables.jl:418 [inlined] [2] AliasTables.AliasTable{UInt64, Int64}(weights::Weights{Int64, Int64, Vector{Int64}}; _normalize::Bool) @ AliasTables ~/.julia/packages/AliasTables/yt2Qj/src/AliasTables.jl:81 [3] AliasTable @ ~/.julia/packages/AliasTables/yt2Qj/src/AliasTables.jl:78 [inlined] [4] AliasTable @ ~/.julia/packages/AliasTables/yt2Qj/src/AliasTables.jl:76 [inlined] [5] alias_sample!(rng::TaskLocalRNG, a::Vector{Float64}, wv::Weights{Int64, Int64, Vector{Int64}}, x::Vector{Float64}) @ StatsBase ~/.julia/dev/StatsBase/src/sampling.jl:719 [6] top-level scope @ REPL[38]:1 ``` Faster (benchmarks from https://github.com/JuliaStats/StatsBase.jl/issues/695#issuecomment-853816909) Before ```julia julia> using Chairmarks julia> @b sample(1:5030, StatsBase.Weights(rand(Float32,5030)), 141230, replace=true) 1.558 ms (19 allocs: 1.251 MiB) julia> @b sample(1:5030, StatsBase.Weights(rand(Float64,5030)), 141230, replace=true) 1.575 ms (19 allocs: 1.270 MiB) ``` After ```julia julia> @b sample(1:5030, StatsBase.Weights(rand(Float32,5030)), 141230, replace=true) 294.460 μs (12 allocs: 1.260 MiB) julia> @b sample(1:5030, StatsBase.Weights(rand(Float64,5030)), 141230, replace=true) 296.418 μs (12 allocs: 1.280 MiB) ``` Closes #630 (AliasTables.jl [uses `@inbounds` in sampling](https://github.com/LilithHafner/AliasTables.jl/blob/cb7b2d64bae60b92931cd1ddadfcfd20a8d0ba91/src/AliasTables.jl#L255) and contains a [correctness proof](https://github.com/LilithHafner/AliasTables.jl/blob/cb7b2d64bae60b92931cd1ddadfcfd20a8d0ba91/src/AliasTables.jl#L260-L300) that relies only on local information and basic properties of unsigned integers) Closes #916 by making `alias_sample!` much faster than even using `ifelse` would. See https://aliastables.lilithhafner.com/dev/#Implementation-details for how. See also: https://github.com/JuliaStats/Distributions.jl/pull/1848 cc @devmotion

Thanks @Lilith for the hard work!

Lilith · April 12, 2024, 9:59pm

The PR @vancleve linked to has more details, but TL;DR is it’s almost always faster and so now StatsBase.jl does use it, and Distributions.jl will likely soon (Use a faster implementation of AliasTables by LilithHafner · Pull Request #1848 · JuliaStats/Distributions.jl · GitHub)

Some reasons one might use AliasTables.jl directly are load time/dependency reduction, or to implement very high performance samplers for downstream types that take advantage of implementation details of the underlying rng (e.g. xoshiro’s bulk generation)

Topic		Replies	Views
Fast sampling from discrete distributions Statistics	0	1230	October 15, 2018
Weighted sampling algorithms not yet shipped? General Usage question , statistics	1	305	October 2, 2021
Inverse transform sampling (discrete distributions sampling)? Performance question	4	849	March 1, 2021
Allocation-free weighted samples Performance memory-allocation , staticarrays , sampling	12	288	September 12, 2024
StatsBase.jl sample size > 31 throws error Statistics	3	803	February 12, 2020

AliasTables.jl

Related topics