Iterators over observations/data

Is there a reason why in StatsBase fit(Histogram, obs::AbstractArray, bins,...) does not accept an iterator over the observations? Possible sources of observations are streams, csv files, databases etc. To just bin them indexing, getindex(obs, ...), is not needed? In my case a 2d histogram is made from ~250 MB of observations in an SQLite database, so this is presently not a show stopper. Typical data sets for histograms fit easily into arrays in memory, but it is not always the case.

I have tried to look at the binning part in the code, but it is a bit over my horizon. Are there alternative packages? In MATLAB accumarray can do the job, vectorized. It is adapted to Julia in VectorizedRoutines

Generally an iterator interface to data would avoid allocations and be more “Julian”.

OnlineStats.jl was made for exactly this kind of streaming statistics.

2 Likes