How to bin data properly for further study, by example to plot it estimated PDf and CDF?

Masacroso · July 18, 2019, 5:34pm

I didn’t found a package to bin data such that I can use the binned data for further analysis, just I found some bin functions inside plotting functions, by example to plot histograms or so, but I want just the binned data, not plot it.

There is some package for this? If not, how to accomplish this task efficiently for large amounts of data (arrays of 20 million or so of length)?

P.S.: Im not so sure if this is the correct place to put this question. If not my apologies.

Tamas_Papp · July 18, 2019, 5:58pm

using StatsBase
x = randn(100)
h = fit(Histogram, x)
h.edges
h.weights

20M data points are not that many, but the package OnlineStats also has functionality for this if you prefer.

Masacroso · July 18, 2019, 6:12pm

Thank you very much, Im seeing the documentation of fit(Histogram,...) of StatsBase right now, however it is not so clear how to transform the frequencies to fit a probability density function (that is, it is supposed that the data that is binned are random samples from an unknown continuous random variable).

Can you help me there?

Tamas_Papp · July 18, 2019, 6:16pm

You can normalize it to a density (see the StatsBase docs), but you might just want to fit a kernel density estimate directly.

The exact answer mostly depends on the methodology you prefer. If you are after nonparametric estimation, choose a method for that, which may or may not involve binning first.

aaowens · July 18, 2019, 6:55pm

You can use the KernalDensity package

julia> using KernelDensity, Plots

julia> d = rand(1000)
1000-element Array{Float64,1}:
 0.5923175294995007 
 0.6219508791482606 
 0.11909739570472966
 0.2696762736531302 
 0.6065472676384998 
 0.8147677674690268 
 0.09467855373568423
 0.7902296886734237 
 0.14590266743335856
 0.738212509820112  
 ⋮                  
 0.6640510128100363 
 0.02775649643695477
 0.7622063565473378 
 0.2182739231701778 
 0.6774344291684018 
 0.49430891353306783
 0.12232537847668223
 0.8532230145650415 
 0.9824081809204155 

julia> k = kde(d)
UnivariateKDE{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}(-0.2584127475678981:0.0007416269643390421:1.2596976484341211, [7.07382e-5, 7.10556e-5, 7.15259e-5, 7.215e-5, 7.2929e-5, 7.3864e-5, 7.49568e-5, 7.6209e-5, 7.76228e-5, 7.92004e-5  …  7.59824e-5, 7.47586e-5, 7.36936e-5, 7.27858e-5, 7.20335e-5, 7.14357e-5, 7.09913e-5, 7.06998e-5, 7.05605e-5, 7.05733e-5])

julia> pdf(k, 0.5)
1.0437705563742128

julia> plot(0:0.01:1, pdf.(Ref(k), 0:0.01:1))

Masacroso · July 18, 2019, 7:13pm

Sorry, but I dont find something about normalization in the StatsBase docs.

Masacroso · July 18, 2019, 7:27pm

Thank you very much, I will test this package. Just one more thing: it could be possible also estimate the CDF with this package? Im reading it tiny documentation but it doesn’t say something about the CDF. Thank you in advance.

aaowens · July 18, 2019, 8:18pm

julia> r = 0:0.01:1
0.0:0.01:1.0
julia> x = pdf.(Ref(k), r);

julia> cdfx = cumsum(x)/sum(x);

julia> plot(r, cdfx)

Karajan · July 19, 2019, 6:21am

ecdf from StatsBase would be another option, I think.

Masacroso · July 21, 2019, 3:48am

Just one more question to understand your code: why you had written pdf.(Ref(k),r) instead of just pdf.(k,r)? It is for something related to efficiency? I was reading the documentation of Ref but it is not so clear what it does (Im not a programmer, just casual programmer).

Tamas_Papp · July 21, 2019, 5:53am

It is documented in

https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting-1

I agree that the documentation could be improved, eg in the docstring of Ref, so I made

https://github.com/JuliaLang/julia/pull/32645

Masacroso · July 21, 2019, 8:05am

However I still doesn’t understand what Ref does in the code of @aaowens, we can just use instead pdf.(k,r). You said that Ref in broadcasting make the reference be treated as an scalar, but I dont know what it really mean for the case of the code above.

aaowens · July 21, 2019, 3:02pm

I see the problem. You can also use pdf(k, r) in this case because a vectorized method is defined for pdf. In fact, you should do that instead because it is a lot faster in this case.

By default these vectorized methods aren’t defined and we just broadcast over the scalar version. In some cases (like here) there is a vectorized method defined which may be faster. I think there’s some fixed cost initial work it’s doing before evaluating the pdf, but the broadcast version pays that many times.

Topic		Replies	Views
Use a Histogram as a function General Usage	4	2221	October 21, 2019
Hist() and hits() in Julia? Statistics question	2	4022	December 30, 2016
Plots: How to create a histogram such that sum of bar heights =1 Visualization	23	22990	August 23, 2017
Bincount of histogram General Usage question , package , plotting	4	3554	March 15, 2020
Fitting a Distribution to existing data Statistics	16	6491	January 5, 2017

How to bin data properly for further study, by example to plot it estimated PDf and CDF?

Related topics