[ANN] StreamSampling.jl - Sampling methods for data streams

Hi all!

I’m happy to announce the new package IteratorSampling.jl, which can help to sample from arbitrary iterables in a single pass through the data. It is mainly based on the theory of reservoir sampling methods.

Since reservoir sampling doesn’t need to collect the iterable in memory, it is faster than using StatsBase.sample in some cases as you can see in the (contrived) example in the ReadMe of the package; this means it can also be useful in some dynamic simulation scenarios where one can’t assume much about the population to sample and so scanning all the iterable is needed.

I plan to add some more features in the future such as weighted sampling methods and the possibility to resume the sampling preserving the unbiasedness of the sample. If you have suggestions on some more features, they are really welcomed!

For those interested, the package is already available in the general registry :slight_smile:

24 Likes

Worth noting that OnlineStats.jl also has reservoir sampling.

5 Likes

Thanks @cydoris! didn’t find it when I searched in the Julia ecosystem!

I did a little benchmark to compare the implementation performance:

julia> using IteratorSampling, OnlineStats, BenchmarkTools

julia> iter = 1:10^7; # length is known

julia> @btime fit!(ReservoirSample(10^4, Int), $iter);
  29.043 ms (3 allocations: 78.28 KiB)

julia> @btime itsample($iter, 10^4);
  2.760 ms (14 allocations: 381.29 KiB)

julia> iter = Iterators.filter(x -> x != 10, 1:10^7); # length is not known

julia> @btime fit!(ReservoirSample(10^4, Int), $iter);
  33.474 ms (3 allocations: 78.28 KiB)

julia> @btime itsample($iter, 10^4);
  7.931 ms (2 allocations: 78.17 KiB)

the speed difference is due to the usage of a different algorithm when the length is known and the use of an optimized implementation for reservoir sampling (Vitter’s algorithm L) when it is not, see the docs API · IteratorSampling.jl for a little more details.

I noticed also that the implementation there covers only unweighted sampling without replacement, in IteratorSampling you have a “general” function itsample([rng], iter, n::Int; replace = false, ordered = false) which mimics StatsBase.sample and (hopefully in the future) weighted sampling available and some more features useful in stream sampling.

I will ask the authors if they would like to have IteratorSampling.jl as a dependency in the future :slight_smile:

4 Likes

A thing I have repeatedly needed and never found a good library for is multi-threading support, i.e. the items are produced / handled in multiple threads (which can be created and destroyed during the lifetime of the reservoir).

One would need support for splitting off from a reservoir, and support for explicitly rejoining a split-off reservoir, i.e. no synchronization at all on consuming a single sample. The meat of code for that feature would be the code for joining reservoirs.

An example use would be allocation profiling: Typical allocations only pay a thread/core-local increment and well-predicted branch, and spinning up or down threads involves locks anyways. This is cheap enough to be basically free.

1 Like

I’m happy to announce version 0.3 of the package!

Now the package has all the basic building blocks I wanted to add when I conceived it.

Apart from the already available itsample, it now allows to control the sampling process from the “outside” by instantiating a ReservoirSample which then can be updated with the update! function:

julia> using StreamSampling

julia> rs = ReservoirSample(Int, 5);

julia> for x in 1:100
           update!(rs, x)
       end

julia> value(rs)
5-element Vector{Int64}:
  7
  9
 20
 49
 74

Also, new weighted sampling algorithms were added both for sampling with and without replacement, which means that all classical sampling procedures are now implemented for arbitrary data streams.

It would be even possible to integrate itsample in StatsBase for discoverability purposes, the StatsBase.sample function could be then used on any kind of iterable, not only on AbstractArray, but this would mean that the user needs to be a bit more careful because collecting the iterators is usually more performant if one needs multiple samples from it. If anyone has thoughts on this please let me know!

I also made a little benchmark comparison with StatsBase.sample on a simple iterator:

(where “collection-based with setup” means that collecting the iterator in memory is considered part of the benchmark)


Importantly, I renamed the package since I thought that StreamSampling.jl was more appropriate. I can’t anymore edit comments referring to the previous name, hopefully it shouldn’t cause too much confusion.

7 Likes

Some updates for v0.5.0:

Let me know if you have some suggestions for improvements, I was mostly focused on reservoir sampling techniques, but I think I could maybe enlarge a bit more the scope of the library.

Interestingly it is actually one of the few open source libraries which tries to implement these techniques, I only found GitHub - bigmlcom/sampling: Random Sampling in Clojure as a similar attempt.

9 Likes

I released a new version which tweaks a bit the API (v0.6), at the same time I implemented two new (but old :smiley:) algorithms for stream sampling without a reservoir, so that they require O(1) memory for unweighted sampling with and without replacement. They are algorithm D by Vitter and algorithm 4 by Bentley, all brought back from research in the eighties.

1 Like