[ANN] DynamicSampling.jl

Hi all,

I’d like to share with you DynamicSampling.jl, a package I built which implements samplers that can be used to sample from a dynamic discrete distribution supporting removal, addition and sampling of elements in constant time. The discrete distribution can be constructed incrementally by adding pairs of indices and weights, e.g.

julia> using DynamicSampling, Random

julia> rng = Xoshiro(42);

julia> sampler = DynamicSampler(rng);

julia> # the sampler contains indices
         for i in 1:10
             push!(sampler, i, Float64(i)) # add elements one by one
         end

julia> rand(sampler)
7

julia> delete!(sampler, 8) # deletes 8 from the sampler

It is already used in a project I’m working on and it results in huge speedups in respect of using static sampling techniques which require re-initialization if the distribution changes over time. Actually it is very fast also in “worst-case scenarios”: in a comparison between the static methods exported by StatsBase.jl and DynamicSampling.jl, the sampler is even better sometimes than the static methods when compared with them, e.g. consider these two implementation of sample with replacement and sample without replacement:

julia> using DynamicSampling

julia> function static_sample_with_replacement(rng, inds, weights, n)
           sp = DynamicSampler(rng)
           append!(sp, inds, weights)
           return rand(sp, n)
       end

julia> function static_sample_without_replacement(rng, inds, weights, n)
           sp = DynamicSampler(rng)
           append!(sp, inds, weights)
           s = Vector{Int}(undef, n)
           for i in eachindex(s)
               idx = rand(sp)
               delete!(sp, idx)
               s[i] = idx
           end
           return s
       end

I run a comparison between the equivalent methods in StatsBase.jl for extracting small and big samples and this is what I got

More info on the benchmark are provided in the ReadMe if interested. The comparison in the case of sampling with replacement hits the AliasTables.jl package, and it is interesting that DynamicSampling.jl is competittive with it in my opinion given that the alias method is generally considered the fastest in the static case (I tag @Lilith who could maybe be interested in this comparison given that they wrote that fantastic library)

9 Likes

Nice. Would this work for randomly sampling vertices or edges from graphs as you edit them?

1 Like

You use a very neat algorithm! I like the concept. Currently, it isn’t implemented with constant time worst cases (e.g., e.g.), but I think it may be possible to make it amortized constant time in the worst case for all three of insertion, deletion, and sampling which would be very cool. I’m not aware of any implementation that achieves all three.

Re: DynamicSampling being competitive with AliasTables, it’s worth mentioning that AliasTables (which do not support insertion or deletion) can be sampled from significantly faster than DynamicSamplers and are only slightly slower to construct:

julia> @b append!(DynamicSampler(), 1:1000, rand(1000)) rand(_, 1000)
13.083 μs (7 allocs: 8.094 KiB)

julia> @b AliasTable(rand(1000)) rand(_, 1000)
1.076 μs (3 allocs: 7.875 KiB)

julia> @b rand(1000) append!(DynamicSampler(), 1:1000, _)
7.354 μs (51 allocs: 20.000 KiB)

julia> @b rand(1000) AliasTable
9.681 μs (2 allocs: 16.031 KiB)

Imperfections aside, I want to be clear that I admire this algorithmic approach.

3 Likes

@Tortar, you may find this article interesting: https://www.researchgate.net/publication/250428390_Optimal_Algorithms_for_Generating_Discrete_Random_Variables_With

1 Like

I think the answer is yes, indeed you can remove/add an index representing edges/vertices at any time, actually I have also implemented setindex!(sampler, index, new_weight) if you want to mutate the weight of an already present index in constant time (I actually just remove and add again the index with the new weight). Though you need to define a mapping from indices to vertices/edges which could be somewhat tricky

1 Like

Thanks a lot @Lilith for the compliments, and to help me improving the package by opening those issues! Yes, you are right, it isn’t currently constant time for all inputs (even if I think that the probability to encounter one of those cases in practice is not high). Trying to read the article at the moment, seems interesting, thank you a lot also for that.

For what regards the performance I found AliasTables and DynamicSampling approximately on par with distributions with bigger domains (e.g. 10^7 indices in your case), but more you reduce the number of indices more AliasTables.jl shines :slight_smile: