[ANN] StreamSampling.jl - Sampling methods for data streams

Tortar · January 13, 2024, 7:40pm

Hi all!

I’m happy to announce the new package IteratorSampling.jl, which can help to sample from arbitrary iterables in a single pass through the data. It is mainly based on the theory of reservoir sampling methods.

Since reservoir sampling doesn’t need to collect the iterable in memory, it is faster than using StatsBase.sample in some cases as you can see in the (contrived) example in the ReadMe of the package; this means it can also be useful in some dynamic simulation scenarios where one can’t assume much about the population to sample and so scanning all the iterable is needed.

I plan to add some more features in the future such as weighted sampling methods and the possibility to resume the sampling preserving the unbiasedness of the sample. If you have suggestions on some more features, they are really welcomed!

For those interested, the package is already available in the general registry

cjdoris · January 15, 2024, 9:08am

Worth noting that OnlineStats.jl also has reservoir sampling.

Tortar · January 15, 2024, 11:12am

Thanks @cydoris! didn’t find it when I searched in the Julia ecosystem!

I did a little benchmark to compare the implementation performance:

julia> using IteratorSampling, OnlineStats, BenchmarkTools

julia> iter = 1:10^7; # length is known

julia> @btime fit!(ReservoirSample(10^4, Int), $iter);
  29.043 ms (3 allocations: 78.28 KiB)

julia> @btime itsample($iter, 10^4);
  2.760 ms (14 allocations: 381.29 KiB)

julia> iter = Iterators.filter(x -> x != 10, 1:10^7); # length is not known

julia> @btime fit!(ReservoirSample(10^4, Int), $iter);
  33.474 ms (3 allocations: 78.28 KiB)

julia> @btime itsample($iter, 10^4);
  7.931 ms (2 allocations: 78.17 KiB)

the speed difference is due to the usage of a different algorithm when the length is known and the use of an optimized implementation for reservoir sampling (Vitter’s algorithm L) when it is not, see the docs API · IteratorSampling.jl for a little more details.

I noticed also that the implementation there covers only unweighted sampling without replacement, in IteratorSampling you have a “general” function itsample([rng], iter, n::Int; replace = false, ordered = false) which mimics StatsBase.sample and (hopefully in the future) weighted sampling available and some more features useful in stream sampling.

I will ask the authors if they would like to have IteratorSampling.jl as a dependency in the future

foobar_lv2 · January 15, 2024, 2:59pm

A thing I have repeatedly needed and never found a good library for is multi-threading support, i.e. the items are produced / handled in multiple threads (which can be created and destroyed during the lifetime of the reservoir).

One would need support for splitting off from a reservoir, and support for explicitly rejoining a split-off reservoir, i.e. no synchronization at all on consuming a single sample. The meat of code for that feature would be the code for joining reservoirs.

An example use would be allocation profiling: Typical allocations only pay a thread/core-local increment and well-predicted branch, and spinning up or down threads involves locks anyways. This is cheap enough to be basically free.

Tortar · April 20, 2024, 9:22pm

I’m happy to announce version 0.3 of the package!

Now the package has all the basic building blocks I wanted to add when I conceived it.

Apart from the already available itsample, it now allows to control the sampling process from the “outside” by instantiating a ReservoirSample which then can be updated with the update! function:

julia> using StreamSampling

julia> rs = ReservoirSample(Int, 5);

julia> for x in 1:100
           update!(rs, x)
       end

julia> value(rs)
5-element Vector{Int64}:
  7
  9
 20
 49
 74

Also, new weighted sampling algorithms were added both for sampling with and without replacement, which means that all classical sampling procedures are now implemented for arbitrary data streams.

It would be even possible to integrate itsample in StatsBase for discoverability purposes, the StatsBase.sample function could be then used on any kind of iterable, not only on AbstractArray, but this would mean that the user needs to be a bit more careful because collecting the iterators is usually more performant if one needs multiple samples from it. If anyone has thoughts on this please let me know!

I also made a little benchmark comparison with StatsBase.sample on a simple iterator:

(where “collection-based with setup” means that collecting the iterator in memory is considered part of the benchmark)

Importantly, I renamed the package since I thought that StreamSampling.jl was more appropriate. I can’t anymore edit comments referring to the previous name, hopefully it shouldn’t cause too much confusion.

Tortar · October 7, 2024, 3:38pm

Some updates for v0.5.0:

The library now supports multithreaded merging of reservoirs as @foobar_lv2 suggested with merge/merge! and it allows to creates a single sample from multiple iterables in parallel with the itsample function. It works only with (weighted) sampling with replacement though.
A little illustrative example is now present in the docs: An Illustrative Example · StreamSampling.jl and some more benchmarks of the sampling procedures are at StreamSampling.jl/benchmark at main · JuliaDynamics/StreamSampling.jl · GitHub.
The API is now in line with the one of OnlineStatsBase.jl and the internals are much more polished.

Let me know if you have some suggestions for improvements, I was mostly focused on reservoir sampling techniques, but I think I could maybe enlarge a bit more the scope of the library.

Interestingly it is actually one of the few open source libraries which tries to implement these techniques, I only found GitHub - bigmlcom/sampling: Random Sampling in Clojure as a similar attempt.

Tortar · October 11, 2024, 3:38pm

I released a new version which tweaks a bit the API (v0.6), at the same time I implemented two new (but old ) algorithms for stream sampling without a reservoir, so that they require O(1) memory for unweighted sampling with and without replacement. They are algorithm D by Vitter and algorithm 4 by Bentley, all brought back from research in the eighties.

Tortar · August 14, 2025, 11:32am

StreamSampling.jl v0.7

New major version landed with some important updates:

The merging API is almost complete, it is now possible to merge multiple samplers together to obtain a unique summary of multiple streams for almost all algorithms. This is very helpful for parallel sampling.
A new example about sampling from on-disk data. See StreamSampling.jl/stable/example/#Sampling-from-Data-on-Disk. We achieve with ease a sampling performance of 500MB/s with 4 threads with HDF5.jl and 2GB/s with Arrow.jl! The most interesting aspect is that the sampling algorithm becomes the bottleneck in the sampling process, more than loading data from disk (at least with an SSD).
Many general performance and interface improvements.

jling · August 15, 2025, 2:56am

given this is sequential read from disk and Arrow in theory is very performant, I wonder what does this represent in terms of the % of max disk I/O?

Tortar · August 15, 2025, 1:26pm

My SSD should has a peak sequential read performance of 3.18GB/s based on specs, I now tried sampling a 100GB arrow file with 6 threads and it took around 40 seconds, so 2.5GB/s (which is also similar to the read speed I see in the System Monitor during sampling). So the % of max disk I/O is around 80%.

Topic		Replies	Views
Allocation-free weighted samples Performance memory-allocation , staticarrays , sampling	12	445	September 12, 2024
Efficient repeated sampling of small vector Performance	13	583	April 8, 2023
Sample without replacement.. and without StatsBase or shuffle General Usage	12	612	May 3, 2024
In `StatsBase.jl`, is it possible for `sample` to not return vectors? General Usage package	2	297	August 2, 2022
[ANN] DynamicSampling.jl Package Announcements package , announcement , sampling	5	402	December 12, 2024

[ANN] StreamSampling.jl - Sampling methods for data streams

StreamSampling.jl v0.7

Related topics