[ANN] DynamicSampling.jl

Hi all,

I’d like to share with you DynamicSampling.jl, a package I built which implements samplers that can be used to sample from a dynamic discrete distribution supporting removal, addition and sampling of elements in constant time. The discrete distribution can be constructed incrementally by adding pairs of indices and weights, e.g.

julia> using DynamicSampling, Random

julia> rng = Xoshiro(42);

julia> sampler = DynamicSampler(rng);

julia> # the sampler contains indices
         for i in 1:10
             push!(sampler, i, Float64(i)) # add elements one by one
         end

julia> rand(sampler)
7

julia> delete!(sampler, 8) # deletes 8 from the sampler

It is already used in a project I’m working on and it results in huge speedups in respect of using static sampling techniques which require re-initialization if the distribution changes over time. Actually it is very fast also in “worst-case scenarios”: in a comparison between the static methods exported by StatsBase.jl and DynamicSampling.jl, the sampler is even better sometimes than the static methods when compared with them, e.g. consider these two implementation of sample with replacement and sample without replacement:

julia> using DynamicSampling

julia> function static_sample_with_replacement(rng, inds, weights, n)
           sp = DynamicSampler(rng)
           append!(sp, inds, weights)
           return rand(sp, n)
       end

julia> function static_sample_without_replacement(rng, inds, weights, n)
           sp = DynamicSampler(rng)
           append!(sp, inds, weights)
           s = Vector{Int}(undef, n)
           for i in eachindex(s)
               idx = rand(sp)
               delete!(sp, idx)
               s[i] = idx
           end
           return s
       end

I run a comparison between the equivalent methods in StatsBase.jl for extracting small and big samples and this is what I got

More info on the benchmark are provided in the ReadMe if interested. The comparison in the case of sampling with replacement hits the AliasTables.jl package, and it is interesting that DynamicSampling.jl is competittive with it in my opinion given that the alias method is generally considered the fastest in the static case (I tag @Lilith who could maybe be interested in this comparison given that they wrote that fantastic library)

7 Likes

Nice. Would this work for randomly sampling vertices or edges from graphs as you edit them?

You use a very neat algorithm! I like the concept. Currently, it isn’t implemented with constant time worst cases (e.g., e.g.), but I think it may be possible to make it amortized constant time in the worst case for all three of insertion, deletion, and sampling which would be very cool. I’m not aware of any implementation that achieves all three.

Re: DynamicSampling being competitive with AliasTables, it’s worth mentioning that AliasTables (which do not support insertion or deletion) can be sampled from significantly faster than DynamicSamplers and are only slightly slower to construct:

julia> @b append!(DynamicSampler(), 1:1000, rand(1000)) rand(_, 1000)
13.083 μs (7 allocs: 8.094 KiB)

julia> @b AliasTable(rand(1000)) rand(_, 1000)
1.076 μs (3 allocs: 7.875 KiB)

julia> @b rand(1000) append!(DynamicSampler(), 1:1000, _)
7.354 μs (51 allocs: 20.000 KiB)

julia> @b rand(1000) AliasTable
9.681 μs (2 allocs: 16.031 KiB)

Imperfections aside, I want to be clear that I admire this algorithmic approach.

1 Like

@Tortar, you may find this article interesting: https://www.researchgate.net/publication/250428390_Optimal_Algorithms_for_Generating_Discrete_Random_Variables_With