# How to bin data properly for further study, by example to plot it estimated PDf and CDF?

I didn’t found a package to bin data such that I can use the binned data for further analysis, just I found some bin functions inside plotting functions, by example to plot histograms or so, but I want just the binned data, not plot it.

There is some package for this? If not, how to accomplish this task efficiently for large amounts of data (arrays of 20 million or so of length)?

P.S.: Im not so sure if this is the correct place to put this question. If not my apologies.

1 Like
``````using StatsBase
x = randn(100)
h = fit(Histogram, x)
h.edges
h.weights
``````

20M data points are not that many, but the package OnlineStats also has functionality for this if you prefer.

2 Likes

Thank you very much, Im seeing the documentation of `fit(Histogram,...)` of StatsBase right now, however it is not so clear how to transform the frequencies to fit a probability density function (that is, it is supposed that the data that is binned are random samples from an unknown continuous random variable).

Can you help me there?

You can normalize it to a density (see the StatsBase docs), but you might just want to fit a kernel density estimate directly.

The exact answer mostly depends on the methodology you prefer. If you are after nonparametric estimation, choose a method for that, which may or may not involve binning first.

You can use the KernalDensity package

``````julia> using KernelDensity, Plots

julia> d = rand(1000)
1000-element Array{Float64,1}:
0.5923175294995007
0.6219508791482606
0.11909739570472966
0.2696762736531302
0.6065472676384998
0.8147677674690268
0.09467855373568423
0.7902296886734237
0.14590266743335856
0.738212509820112
⋮
0.6640510128100363
0.02775649643695477
0.7622063565473378
0.2182739231701778
0.6774344291684018
0.49430891353306783
0.12232537847668223
0.8532230145650415
0.9824081809204155

julia> k = kde(d)
UnivariateKDE{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}(-0.2584127475678981:0.0007416269643390421:1.2596976484341211, [7.07382e-5, 7.10556e-5, 7.15259e-5, 7.215e-5, 7.2929e-5, 7.3864e-5, 7.49568e-5, 7.6209e-5, 7.76228e-5, 7.92004e-5  …  7.59824e-5, 7.47586e-5, 7.36936e-5, 7.27858e-5, 7.20335e-5, 7.14357e-5, 7.09913e-5, 7.06998e-5, 7.05605e-5, 7.05733e-5])

julia> pdf(k, 0.5)
1.0437705563742128

julia> plot(0:0.01:1, pdf.(Ref(k), 0:0.01:1))
``````
2 Likes

Sorry, but I dont find something about normalization in the StatsBase docs.

Thank you very much, I will test this package. Just one more thing: it could be possible also estimate the CDF with this package? Im reading it tiny documentation but it doesn’t say something about the CDF. Thank you in advance.

``````julia> r = 0:0.01:1
0.0:0.01:1.0
julia> x = pdf.(Ref(k), r);

julia> cdfx = cumsum(x)/sum(x);

julia> plot(r, cdfx)
``````
1 Like

ecdf from `StatsBase` would be another option, I think.

1 Like

Just one more question to understand your code: why you had written `pdf.(Ref(k),r)` instead of just `pdf.(k,r)`? It is for something related to efficiency? I was reading the documentation of `Ref` but it is not so clear what it does (Im not a programmer, just casual programmer).

It is documented in

I agree that the documentation could be improved, eg in the docstring of `Ref`, so I made
However I still doesn’t understand what `Ref` does in the code of @aaowens, we can just use instead `pdf.(k,r)`. You said that `Ref` in broadcasting make the reference be treated as an scalar, but I dont know what it really mean for the case of the code above.
I see the problem. You can also use `pdf(k, r)` in this case because a vectorized method is defined for `pdf`. In fact, you should do that instead because it is a lot faster in this case.