Binning Large Collections of Timestamps

Hey, I have a large amount of timestamps (in the form of an array of Float64 seconds), and I need to bin them (into ‘counts’, so something like [1, 10, 40, 122] binned to 60 seconds would give counts of [3, 0, 1] with time bins of 0:1:2) before taking a Fourier transform to do some timing analysis.

Previously I just used the Histogram fit to bin my data, however now that I’ve started working with bigger datasets (500 million+ timestamps) the Histogram function is a bit too slow. Looking around I haven’t found many alternatives.

Does anybody know of a package that does binning for large amounts of data, or if there’s some efficient way to parallelise/improve the process? Right now I’m thinking of reshaping the single column of timestamps into as many columns as there are processors, then performing the histogram fit on each of those in parallel, and summing the resulting histogram weights. But I imagine there’s a better solution floating around somewhere else.

1 Like

OnlineStats.jl can do histogram fitting out-of-memory: https://joshday.github.io/OnlineStats.jl/stable/api.html#OnlineStats.Hist

using OnlineStats
x = rand(1_000_000)
edges = 0:0.01:1
h = Hist(edges)
fit!(h, x)

You can also merge histograms, so you can chunk your dataset and process it in parallel:

x2 = rand(1_000_000)
h2 = Hist(edges)
fit!(h2, x2)
hfinal = merge(h1, h2)
3 Likes

Oh wow, that’s fantastic! Thanks for the help, now instead of waiting minutes to hours for a server to finish just the binning, my laptop can do it in a few seconds. That’s actually amazing.

3 Likes

I know, the first time I used OnlineStats with a big dataset it finished so fast I had to double-check to convince myself it had actually calculated anything :laughing:

2 Likes

I suspect StatsBase.countmap (method 1) or method 2 below would be quite fast for this, see code below. Assuming you don’t need to read the data small chunks at a time in which case OnlineStats.jl would be a better choice.

using StatsBase
timestamps = rand(Float64,Int64(5e7))*6000.+1

# method 1 using StatsBase.countmap
@time countmap(floor.((timestamps .- 1) ./ 60).+1)
@time countmap(floor.((timestamps .- 1) ./ 60).+1)

# method 2
count_bin(timestamps) = begin
  max_timestampe_range = Int64(floor((maximum(timestamps)-1)/60))+1

  hist_arr = [0 for i = 1:max_timestampe_range]

 @inbounds for j in Int64.(floor.((timestamps .- 1) ./ 60).+1)
    hist_arr[j] += 1
  end
  hist_arr
end

@time count_bin(timestamps)
@time count_bin(timestamps)
1 Like

Awesome, thanks for the tip, I was sure there’d be a simple way to do this in Base (or, well, things that were in Base) that just didn’t occur to me.

You might also be interested in https://github.com/joshday/AverageShiftedHistograms.jl, which can fit histograms much faster than StatsBase, can be updated online just like in OnlineStats, and also does kernel density estimation on the histogram to smooth the result.

julia> using AverageShiftedHistograms, StatsBase, BenchmarkTools

julia> y = randn(10^7);

julia> @btime fit(Histogram, y, -5:.1:5; closed=:left);
  385.248 ms (7 allocations: 1.17 KiB)

julia> @btime ash(y, rng=-5:.1:5);
  27.625 ms (6 allocations: 2.03 KiB)

Edit: The downside of AverageShiftedHistograms compared to OnlineStats is that you need either 1) know the range of your data at the start, or 2) start with a representative sample. OnlineStats can adaptively change the bins to accept data outside the current range whereas AverageShiftedHistograms cannot.

2 Likes