Hey, I have a large amount of timestamps (in the form of an array of Float64 seconds), and I need to bin them (into ‘counts’, so something like [1, 10, 40, 122] binned to 60 seconds would give counts of [3, 0, 1] with time bins of 0:1:2) before taking a Fourier transform to do some timing analysis.
Previously I just used the Histogram fit to bin my data, however now that I’ve started working with bigger datasets (500 million+ timestamps) the Histogram function is a bit too slow. Looking around I haven’t found many alternatives.
Does anybody know of a package that does binning for large amounts of data, or if there’s some efficient way to parallelise/improve the process? Right now I’m thinking of reshaping the single column of timestamps into as many columns as there are processors, then performing the histogram fit on each of those in parallel, and summing the resulting histogram weights. But I imagine there’s a better solution floating around somewhere else.
Oh wow, that’s fantastic! Thanks for the help, now instead of waiting minutes to hours for a server to finish just the binning, my laptop can do it in a few seconds. That’s actually amazing.
I know, the first time I used OnlineStats with a big dataset it finished so fast I had to double-check to convince myself it had actually calculated anything
I suspect StatsBase.countmap (method 1) or method 2 below would be quite fast for this, see code below. Assuming you don’t need to read the data small chunks at a time in which case OnlineStats.jl would be a better choice.
using StatsBase
timestamps = rand(Float64,Int64(5e7))*6000.+1
# method 1 using StatsBase.countmap
@time countmap(floor.((timestamps .- 1) ./ 60).+1)
@time countmap(floor.((timestamps .- 1) ./ 60).+1)
# method 2
count_bin(timestamps) = begin
max_timestampe_range = Int64(floor((maximum(timestamps)-1)/60))+1
hist_arr = [0 for i = 1:max_timestampe_range]
@inbounds for j in Int64.(floor.((timestamps .- 1) ./ 60).+1)
hist_arr[j] += 1
end
hist_arr
end
@time count_bin(timestamps)
@time count_bin(timestamps)
You might also be interested in https://github.com/joshday/AverageShiftedHistograms.jl, which can fit histograms much faster than StatsBase, can be updated online just like in OnlineStats, and also does kernel density estimation on the histogram to smooth the result.
julia> using AverageShiftedHistograms, StatsBase, BenchmarkTools
julia> y = randn(10^7);
julia> @btime fit(Histogram, y, -5:.1:5; closed=:left);
385.248 ms (7 allocations: 1.17 KiB)
julia> @btime ash(y, rng=-5:.1:5);
27.625 ms (6 allocations: 2.03 KiB)
Edit: The downside of AverageShiftedHistograms compared to OnlineStats is that you need either 1) know the range of your data at the start, or 2) start with a representative sample. OnlineStats can adaptively change the bins to accept data outside the current range whereas AverageShiftedHistograms cannot.