I would like to use kernel density estimation to estimate the cdf from data. One way to get the cdf is to numerically integrate the estimated pdf with quadgk, but the drawback is it is somewhat inefficient. Is there a more efficient method?
Thank you for your reply. I was considering ecdf in StatsBase.jl. The main downside is that it can be noticeably jagged, particularly in the tails. I don’t know if there is some smoothing that would be reasonable to apply. The main constraint would be that the smoothed curve is monotonically increasing.
Usually one estimates a distribution (which parametric, semi-parametric, or non-parametric methods), which then has a CDF.
Why do you need the CDF specifically? If you want to sample from the estimated distribution, there is usually a more direct way. Eg for KDE, there is a simple two-step method using the kernel.
What I am ultimately trying to do is estimate the cumulative hazard function of a model without a closed-form cdf. The cumulative hazard function is -log(1 - F(x)), where F is the cdf. What I have found with the empirical cdf is that it can be somewhat unstable in the tails. I could use something like linear interpolation like nsajko suggested or perhaps run more monte carlo simulations of the model to improve stability.
But I have been interested in exploring options with KDE. Can you tell me more about the two step method?
That said, if your tails have few observations (as tails usually do ) but are relevant for your results, I would recommend a parametric approach to impose some structure on it. You can make it a mixture to fit the data better: eg start with a simple function, estimate, simulate to see where the discrepancies are, then extend until it fits.
Much appreciated. Thank you both for your recommendations. I will compare the lecture notes to the implementation in the Python library called StatisticsModels, but it looks like that just uses numerical integration.
Here is one more option in case someone is interested. In this R library, a three step process is used: estimate kernel density, use kernel density to find cdf to approximate cdf with a selected step size, apply linear interpolation.
The KDE is basically a convolution of the kernel with the discrete distribution of the data. You can randomly select data points and then add a random value from the kernel to bulk up your dataset, then use the ECDF on the denser dataset to get a smoother ECDF.
bulkedup = let data = rand(37), # small dataset
bulkdata = Float64[]
for i in 1:10000
push!(bulkdata,rand(data) + rand(Normal(0,1)))
end
bulkdata
end
something like that, I’m using a Normal(0,1) kernel, but use whatever kernel you like
After some digging, I came across the Nelson-Aalen estimator for cumulative hazard functions. My original motivation was to use the cdf as a simple way to compute the cumulative hazard function indirectly. This might be useful to someone in the future.
In case of normal kernels, the main parameter that is fitted by KDE is actually the bandwidth b, i.e., standard deviation of the kernels. The actual density is then a mixture distribution of the form
p_{KDE}(x) = \sum_i \frac{1}{N} \mathcal{N}(x | x_i, b)
Thus, the following might work:
using Distributions
using KernelDensity
using Plots
x = rand(Normal(0, 1), 100)
b = KernelDensity.default_bandwidth(x)
fit1 = kde(x)
fit2 = MixtureModel(Normal.(x, b))
scatter(pdf(fit1, x), pdf.(fit2, x)) # these should be basically identical
x = sort(x)
plot(x, cdf.(fit2, x)) # Mixture model has cdf method ... note that broadcasting