Binning Large Collections of Timestamps

RobertR · August 16, 2018, 6:29pm

Hey, I have a large amount of timestamps (in the form of an array of Float64 seconds), and I need to bin them (into ‘counts’, so something like [1, 10, 40, 122] binned to 60 seconds would give counts of [3, 0, 1] with time bins of 0:1:2) before taking a Fourier transform to do some timing analysis.

Previously I just used the Histogram fit to bin my data, however now that I’ve started working with bigger datasets (500 million+ timestamps) the Histogram function is a bit too slow. Looking around I haven’t found many alternatives.

Does anybody know of a package that does binning for large amounts of data, or if there’s some efficient way to parallelise/improve the process? Right now I’m thinking of reshaping the single column of timestamps into as many columns as there are processors, then performing the histogram fit on each of those in parallel, and summing the resulting histogram weights. But I imagine there’s a better solution floating around somewhere else.

ElOceanografo · August 16, 2018, 7:36pm

OnlineStats.jl can do histogram fitting out-of-memory: https://joshday.github.io/OnlineStats.jl/stable/api.html#OnlineStats.Hist

using OnlineStats
x = rand(1_000_000)
edges = 0:0.01:1
h = Hist(edges)
fit!(h, x)

You can also merge histograms, so you can chunk your dataset and process it in parallel:

x2 = rand(1_000_000)
h2 = Hist(edges)
fit!(h2, x2)
hfinal = merge(h1, h2)

RobertR · August 16, 2018, 9:26pm

Oh wow, that’s fantastic! Thanks for the help, now instead of waiting minutes to hours for a server to finish just the binning, my laptop can do it in a few seconds. That’s actually amazing.

ElOceanografo · August 16, 2018, 9:48pm

I know, the first time I used OnlineStats with a big dataset it finished so fast I had to double-check to convince myself it had actually calculated anything

xiaodai · August 17, 2018, 2:40am

I suspect StatsBase.countmap (method 1) or method 2 below would be quite fast for this, see code below. Assuming you don’t need to read the data small chunks at a time in which case OnlineStats.jl would be a better choice.

using StatsBase
timestamps = rand(Float64,Int64(5e7))*6000.+1

# method 1 using StatsBase.countmap
@time countmap(floor.((timestamps .- 1) ./ 60).+1)
@time countmap(floor.((timestamps .- 1) ./ 60).+1)

# method 2
count_bin(timestamps) = begin
  max_timestampe_range = Int64(floor((maximum(timestamps)-1)/60))+1

  hist_arr = [0 for i = 1:max_timestampe_range]

 @inbounds for j in Int64.(floor.((timestamps .- 1) ./ 60).+1)
    hist_arr[j] += 1
  end
  hist_arr
end

@time count_bin(timestamps)
@time count_bin(timestamps)

RobertR · August 18, 2018, 11:08am

Awesome, thanks for the tip, I was sure there’d be a simple way to do this in Base (or, well, things that were in Base) that just didn’t occur to me.

joshday · August 28, 2018, 3:19pm

You might also be interested in GitHub - joshday/AverageShiftedHistograms.jl: ⚡ Lightning fast density estimation in Julia ⚡, which can fit histograms much faster than StatsBase, can be updated online just like in OnlineStats, and also does kernel density estimation on the histogram to smooth the result.

julia> using AverageShiftedHistograms, StatsBase, BenchmarkTools

julia> y = randn(10^7);

julia> @btime fit(Histogram, y, -5:.1:5; closed=:left);
  385.248 ms (7 allocations: 1.17 KiB)

julia> @btime ash(y, rng=-5:.1:5);
  27.625 ms (6 allocations: 2.03 KiB)

Edit: The downside of AverageShiftedHistograms compared to OnlineStats is that you need either 1) know the range of your data at the start, or 2) start with a representative sample. OnlineStats can adaptively change the bins to accept data outside the current range whereas AverageShiftedHistograms cannot.

Topic		Replies	Views
Speeding up `fit(Histogram)` Performance	10	768	November 17, 2023
Sorting common elements into bins General Usage	10	1726	March 8, 2018
Data structure for multidimensional histogram General Usage	7	1641	March 30, 2018
(Efficient) function to create a Histogram incrementally without plotting it? Statistics	2	627	November 29, 2019
Weighted histogram ~2x as slow in Julia vs. Python Performance question , statistics , python	13	1202	January 20, 2022

Binning Large Collections of Timestamps

Related topics