Kernel Density Estimate for cdf

Christopher_Fisher · September 25, 2022, 10:56pm

Hi all,

I would like to use kernel density estimation to estimate the cdf from data. One way to get the cdf is to numerically integrate the estimated pdf with quadgk, but the drawback is it is somewhat inefficient. Is there a more efficient method?

nsajko · September 26, 2022, 1:02am

For estimating the cdf from a sample you probably want to use the ecdf (empirical distribution function):

It’s trivial to implement, and much more foolproof to use than KDE, because there’s no bandwidth parameter necessary.

You can find an implementation of mine here (it’s a package but not yet registered):

gitlab.com

nsajko/UnivariateProbabilityDistributionVisualizationCalculation.jl/-/blob/main/src/UnivariateProbabilityDistributionVisualizationCalculation.jl#L92-L120

Christopher_Fisher · September 26, 2022, 12:23pm

Thank you for your reply. I was considering ecdf in StatsBase.jl. The main downside is that it can be noticeably jagged, particularly in the tails. I don’t know if there is some smoothing that would be reasonable to apply. The main constraint would be that the smoothed curve is monotonically increasing.

nsajko · September 26, 2022, 12:46pm

Would linear interpolation be enough?

Tamas_Papp · September 26, 2022, 12:46pm

Usually one estimates a distribution (which parametric, semi-parametric, or non-parametric methods), which then has a CDF.

Why do you need the CDF specifically? If you want to sample from the estimated distribution, there is usually a more direct way. Eg for KDE, there is a simple two-step method using the kernel.

Christopher_Fisher · September 26, 2022, 1:19pm

What I am ultimately trying to do is estimate the cumulative hazard function of a model without a closed-form cdf. The cumulative hazard function is -log(1 - F(x)), where F is the cdf. What I have found with the empirical cdf is that it can be somewhat unstable in the tails. I could use something like linear interpolation like nsajko suggested or perhaps run more monte carlo simulations of the model to improve stability.

But I have been interested in exploring options with KDE. Can you tell me more about the two step method?

Tamas_Papp · September 26, 2022, 1:33pm

See eg page 5 of these lecture notes.

That said, if your tails have few observations (as tails usually do ) but are relevant for your results, I would recommend a parametric approach to impose some structure on it. You can make it a mixture to fit the data better: eg start with a simple function, estimate, simulate to see where the discrepancies are, then extend until it fits.

Christopher_Fisher · September 26, 2022, 1:44pm

Much appreciated. Thank you both for your recommendations. I will compare the lecture notes to the implementation in the Python library called StatisticsModels, but it looks like that just uses numerical integration.

Christopher_Fisher · September 26, 2022, 2:01pm

Here is one more option in case someone is interested. In this R library, a three step process is used: estimate kernel density, use kernel density to find cdf to approximate cdf with a selected step size, apply linear interpolation.

dlakelan · September 26, 2022, 2:10pm

The KDE is basically a convolution of the kernel with the discrete distribution of the data. You can randomly select data points and then add a random value from the kernel to bulk up your dataset, then use the ECDF on the denser dataset to get a smoother ECDF.

bulkedup = let data = rand(37), # small dataset
     bulkdata = Float64[]
     for i in 1:10000
         push!(bulkdata,rand(data) + rand(Normal(0,1)))
     end
     bulkdata
end

something like that, I’m using a Normal(0,1) kernel, but use whatever kernel you like

Christopher_Fisher · September 26, 2022, 5:23pm

After some digging, I came across the Nelson-Aalen estimator for cumulative hazard functions. My original motivation was to use the cdf as a simple way to compute the cumulative hazard function indirectly. This might be useful to someone in the future.

bertschi · September 26, 2022, 7:34pm

In case of normal kernels, the main parameter that is fitted by KDE is actually the bandwidth b, i.e., standard deviation of the kernels. The actual density is then a mixture distribution of the form

p_{KDE}(x) = \sum_i \frac{1}{N} \mathcal{N}(x | x_i, b)

Thus, the following might work:

using Distributions
using KernelDensity
using Plots

x = rand(Normal(0, 1), 100)
b = KernelDensity.default_bandwidth(x)

fit1 = kde(x)
fit2 = MixtureModel(Normal.(x, b))

scatter(pdf(fit1, x), pdf.(fit2, x))  # these should be basically identical

x = sort(x)
plot(x, cdf.(fit2, x))  # Mixture model has cdf method ... note that broadcasting

Topic		Replies	Views
Get smooth CDF from OnlineStats's Quantile Statistics	6	207	May 6, 2024
On-the-flight kernel density estimation? Statistics	15	1712	March 13, 2023
Getting the functional form of a pdf using Kernel density estimation General Usage	0	197	May 16, 2022
Empirical distribution type for continuous variables Statistics question , proposal	18	4849	February 14, 2023
How to write a custom distribution using pdf from a Kernel Density Estimator? Performance statistics , turing , distributions	4	540	September 14, 2021

Kernel Density Estimate for cdf

Related topics