How to bin data properly for further study, by example to plot it estimated PDf and CDF?

I didn’t found a package to bin data such that I can use the binned data for further analysis, just I found some bin functions inside plotting functions, by example to plot histograms or so, but I want just the binned data, not plot it.

There is some package for this? If not, how to accomplish this task efficiently for large amounts of data (arrays of 20 million or so of length)?

P.S.: Im not so sure if this is the correct place to put this question. If not my apologies.

2 Likes
using StatsBase
x = randn(100)
h = fit(Histogram, x)
h.edges
h.weights

20M data points are not that many, but the package OnlineStats also has functionality for this if you prefer.

2 Likes

Thank you very much, Im seeing the documentation of fit(Histogram,...) of StatsBase right now, however it is not so clear how to transform the frequencies to fit a probability density function (that is, it is supposed that the data that is binned are random samples from an unknown continuous random variable).

Can you help me there?

You can normalize it to a density (see the StatsBase docs), but you might just want to fit a kernel density estimate directly.

The exact answer mostly depends on the methodology you prefer. If you are after nonparametric estimation, choose a method for that, which may or may not involve binning first.

You can use the KernalDensity package

julia> using KernelDensity, Plots

julia> d = rand(1000)
1000-element Array{Float64,1}:
 0.5923175294995007 
 0.6219508791482606 
 0.11909739570472966
 0.2696762736531302 
 0.6065472676384998 
 0.8147677674690268 
 0.09467855373568423
 0.7902296886734237 
 0.14590266743335856
 0.738212509820112  
 ⋮                  
 0.6640510128100363 
 0.02775649643695477
 0.7622063565473378 
 0.2182739231701778 
 0.6774344291684018 
 0.49430891353306783
 0.12232537847668223
 0.8532230145650415 
 0.9824081809204155 

julia> k = kde(d)
UnivariateKDE{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}(-0.2584127475678981:0.0007416269643390421:1.2596976484341211, [7.07382e-5, 7.10556e-5, 7.15259e-5, 7.215e-5, 7.2929e-5, 7.3864e-5, 7.49568e-5, 7.6209e-5, 7.76228e-5, 7.92004e-5  …  7.59824e-5, 7.47586e-5, 7.36936e-5, 7.27858e-5, 7.20335e-5, 7.14357e-5, 7.09913e-5, 7.06998e-5, 7.05605e-5, 7.05733e-5])

julia> pdf(k, 0.5)
1.0437705563742128

julia> plot(0:0.01:1, pdf.(Ref(k), 0:0.01:1))
2 Likes

Sorry, but I dont find something about normalization in the StatsBase docs.

Thank you very much, I will test this package. Just one more thing: it could be possible also estimate the CDF with this package? Im reading it tiny documentation but it doesn’t say something about the CDF. Thank you in advance.

julia> r = 0:0.01:1
0.0:0.01:1.0
julia> x = pdf.(Ref(k), r);

julia> cdfx = cumsum(x)/sum(x);

julia> plot(r, cdfx)
1 Like

ecdf from StatsBase would be another option, I think.

2 Likes

Just one more question to understand your code: why you had written pdf.(Ref(k),r) instead of just pdf.(k,r)? It is for something related to efficiency? I was reading the documentation of Ref but it is not so clear what it does (Im not a programmer, just casual programmer).

It is documented in

https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting-1

I agree that the documentation could be improved, eg in the docstring of Ref, so I made

https://github.com/JuliaLang/julia/pull/32645

2 Likes

However I still doesn’t understand what Ref does in the code of @aaowens, we can just use instead pdf.(k,r). You said that Ref in broadcasting make the reference be treated as an scalar, but I dont know what it really mean for the case of the code above.

I see the problem. You can also use pdf(k, r) in this case because a vectorized method is defined for pdf. In fact, you should do that instead because it is a lot faster in this case.

By default these vectorized methods aren’t defined and we just broadcast over the scalar version. In some cases (like here) there is a vectorized method defined which may be faster. I think there’s some fixed cost initial work it’s doing before evaluating the pdf, but the broadcast version pays that many times.

2 Likes