I didn’t found a package to bin data such that I can use the binned data for further analysis, just I found some bin functions inside plotting functions, by example to plot histograms or so, but I want just the binned data, not plot it.
There is some package for this? If not, how to accomplish this task efficiently for large amounts of data (arrays of 20 million or so of length)?
P.S.: Im not so sure if this is the correct place to put this question. If not my apologies.
2 Likes
using StatsBase
x = randn(100)
h = fit(Histogram, x)
h.edges
h.weights
20M data points are not that many, but the package OnlineStats also has functionality for this if you prefer.
2 Likes
Thank you very much, Im seeing the documentation of fit(Histogram,...)
of StatsBase right now, however it is not so clear how to transform the frequencies to fit a probability density function (that is, it is supposed that the data that is binned are random samples from an unknown continuous random variable).
Can you help me there?
You can normalize it to a density (see the StatsBase docs), but you might just want to fit a kernel density estimate directly.
The exact answer mostly depends on the methodology you prefer. If you are after nonparametric estimation, choose a method for that, which may or may not involve binning first.
You can use the KernalDensity package
julia> using KernelDensity, Plots
julia> d = rand(1000)
1000-element Array{Float64,1}:
0.5923175294995007
0.6219508791482606
0.11909739570472966
0.2696762736531302
0.6065472676384998
0.8147677674690268
0.09467855373568423
0.7902296886734237
0.14590266743335856
0.738212509820112
⋮
0.6640510128100363
0.02775649643695477
0.7622063565473378
0.2182739231701778
0.6774344291684018
0.49430891353306783
0.12232537847668223
0.8532230145650415
0.9824081809204155
julia> k = kde(d)
UnivariateKDE{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}(-0.2584127475678981:0.0007416269643390421:1.2596976484341211, [7.07382e-5, 7.10556e-5, 7.15259e-5, 7.215e-5, 7.2929e-5, 7.3864e-5, 7.49568e-5, 7.6209e-5, 7.76228e-5, 7.92004e-5 … 7.59824e-5, 7.47586e-5, 7.36936e-5, 7.27858e-5, 7.20335e-5, 7.14357e-5, 7.09913e-5, 7.06998e-5, 7.05605e-5, 7.05733e-5])
julia> pdf(k, 0.5)
1.0437705563742128
julia> plot(0:0.01:1, pdf.(Ref(k), 0:0.01:1))
2 Likes
Sorry, but I dont find something about normalization in the StatsBase docs.
Thank you very much, I will test this package. Just one more thing: it could be possible also estimate the CDF with this package? Im reading it tiny documentation but it doesn’t say something about the CDF. Thank you in advance.
julia> r = 0:0.01:1
0.0:0.01:1.0
julia> x = pdf.(Ref(k), r);
julia> cdfx = cumsum(x)/sum(x);
julia> plot(r, cdfx)
1 Like
ecdf from StatsBase
would be another option, I think.
2 Likes
Just one more question to understand your code: why you had written pdf.(Ref(k),r)
instead of just pdf.(k,r)
? It is for something related to efficiency? I was reading the documentation of Ref
but it is not so clear what it does (Im not a programmer, just casual programmer).
It is documented in
https://docs.julialang.org/en/v1/manual/arrays/#Broadcasting-1
I agree that the documentation could be improved, eg in the docstring of Ref
, so I made
https://github.com/JuliaLang/julia/pull/32645
2 Likes
However I still doesn’t understand what Ref
does in the code of @aaowens, we can just use instead pdf.(k,r)
. You said that Ref
in broadcasting make the reference be treated as an scalar, but I dont know what it really mean for the case of the code above.
I see the problem. You can also use pdf(k, r)
in this case because a vectorized method is defined for pdf
. In fact, you should do that instead because it is a lot faster in this case.
By default these vectorized methods aren’t defined and we just broadcast over the scalar version. In some cases (like here) there is a vectorized method defined which may be faster. I think there’s some fixed cost initial work it’s doing before evaluating the pdf, but the broadcast version pays that many times.
2 Likes