I’m happy to announce the new package IteratorSampling.jl, which can help to sample from arbitrary iterables in a single pass through the data. It is mainly based on the theory of reservoir sampling methods.
Since reservoir sampling doesn’t need to collect the iterable in memory, it is faster than using StatsBase.sample in some cases as you can see in the (contrived) example in the ReadMe of the package; this means it can also be useful in some dynamic simulation scenarios where one can’t assume much about the population to sample and so scanning all the iterable is needed.
I plan to add some more features in the future such as weighted sampling methods and the possibility to resume the sampling preserving the unbiasedness of the sample. If you have suggestions on some more features, they are really welcomed!
For those interested, the package is already available in the general registry
Thanks @cydoris! didn’t find it when I searched in the Julia ecosystem!
I did a little benchmark to compare the implementation performance:
julia> using IteratorSampling, OnlineStats, BenchmarkTools
julia> iter = 1:10^7; # length is known
julia> @btime fit!(ReservoirSample(10^4, Int), $iter);
29.043 ms (3 allocations: 78.28 KiB)
julia> @btime itsample($iter, 10^4);
2.760 ms (14 allocations: 381.29 KiB)
julia> iter = Iterators.filter(x -> x != 10, 1:10^7); # length is not known
julia> @btime fit!(ReservoirSample(10^4, Int), $iter);
33.474 ms (3 allocations: 78.28 KiB)
julia> @btime itsample($iter, 10^4);
7.931 ms (2 allocations: 78.17 KiB)
the speed difference is due to the usage of a different algorithm when the length is known and the use of an optimized implementation for reservoir sampling (Vitter’s algorithm L) when it is not, see the docs API · IteratorSampling.jl for a little more details.
I noticed also that the implementation there covers only unweighted sampling without replacement, in IteratorSampling you have a “general” function itsample([rng], iter, n::Int; replace = false, ordered = false) which mimics StatsBase.sample and (hopefully in the future) weighted sampling available and some more features useful in stream sampling.
I will ask the authors if they would like to have IteratorSampling.jl as a dependency in the future
A thing I have repeatedly needed and never found a good library for is multi-threading support, i.e. the items are produced / handled in multiple threads (which can be created and destroyed during the lifetime of the reservoir).
One would need support for splitting off from a reservoir, and support for explicitly rejoining a split-off reservoir, i.e. no synchronization at all on consuming a single sample. The meat of code for that feature would be the code for joining reservoirs.
An example use would be allocation profiling: Typical allocations only pay a thread/core-local increment and well-predicted branch, and spinning up or down threads involves locks anyways. This is cheap enough to be basically free.
Now the package has all the basic building blocks I wanted to add when I conceived it.
Apart from the already available itsample, it now allows to control the sampling process from the “outside” by instantiating a ReservoirSample which then can be updated with the update! function:
julia> using StreamSampling
julia> rs = ReservoirSample(Int, 5);
julia> for x in 1:100
update!(rs, x)
end
julia> value(rs)
5-element Vector{Int64}:
7
9
20
49
74
Also, new weighted sampling algorithms were added both for sampling with and without replacement, which means that all classical sampling procedures are now implemented for arbitrary data streams.
It would be even possible to integrate itsample in StatsBase for discoverability purposes, the StatsBase.sample function could be then used on any kind of iterable, not only on AbstractArray, but this would mean that the user needs to be a bit more careful because collecting the iterators is usually more performant if one needs multiple samples from it. If anyone has thoughts on this please let me know!
I also made a little benchmark comparison with StatsBase.sample on a simple iterator:
(where “collection-based with setup” means that collecting the iterator in memory is considered part of the benchmark)
Importantly, I renamed the package since I thought that StreamSampling.jl was more appropriate. I can’t anymore edit comments referring to the previous name, hopefully it shouldn’t cause too much confusion.
The library now supports multithreaded merging of reservoirs as @foobar_lv2 suggested with merge/merge! and it allows to creates a single sample from multiple iterables in parallel with the itsample function. It works only with (weighted) sampling with replacement though.
The API is now in line with the one of OnlineStatsBase.jl and the internals are much more polished.
Let me know if you have some suggestions for improvements, I was mostly focused on reservoir sampling techniques, but I think I could maybe enlarge a bit more the scope of the library.
I released a new version which tweaks a bit the API (v0.6), at the same time I implemented two new (but old ) algorithms for stream sampling without a reservoir, so that they require O(1) memory for unweighted sampling with and without replacement. They are algorithm D by Vitter and algorithm 4 by Bentley, all brought back from research in the eighties.