Get smooth CDF from OnlineStats's Quantile


I am tracking statistics of some simulations using the Quantile object from the amazing OnlineStats.jl

I am also interested in being able to plot the CDF of the variable being observed with the Quantile but trying to do so directly with a code like so

using OnlineStats
q = fit!(Quantile(), randn(10^4))
# Get the CDF points
y = range(0,1; length = 101)
x = map(z -> value(q, z), y)

will create a CDF which is not smooth.

I could also fit! my data through an OrderStats(100) object but I have the impression that I should be able to reconstruct a smooth CDF from a Quantile (considering it holds an histogram with 500 bins) withouth the need of keeping a separate OrdersStats just for plotting purposes.

I have seen that I can get something more smooth by wrapping the Quantile’s histogram in the Ash object still from OnlineStats to get a smoothed PDF and integrate that, but is there maybe a better method to get the smoothed CDF from my Quantile object?

I am tagging @joshday as he is the main author of OnlineStats so he might already know the best approach :slight_smile:

Quantiles are estimated from a histogram, which will have jumps. I think the right way to plot it would be Plots.plot(x, y, seriestype=:step). If you need a smoother estimate, you’ll need to add more bins in the histogram, e.g. Quantile(b = 5000) (default is 500).

If you wanted it to be actually smooth, I’d use a kernel density first and then integrate it to get the CDF.

Thanks @tbeason, I actually found the Ash function from this other somehow related post:

I understand that the AverageShiftedHistogram is somehow something that does similarly to the kernel density you mentioned, so using that and integrating its output was indeed the first thing I tried as mentioned in the original post.

Do you think this approach is actually different from what you are suggesting?

I was actually trying out the different options in a Pluto notebook:

Notebook Code
# ╔═╡ 21f03ec8-4e76-4e7c-8cac-3e67d9e79e44
	using OnlineStats
	using PlutoPlotly
	using Statistics

# ╔═╡ dff0f3f2-6706-4f60-a607-1fcff9b3b314
	db2lin(x) = 10.0^(x/10)
	q = Quantile()
	o = OrderStats(100)
	a = db2lin.(randn(10^5) .* 2) # Let's create some lognormal variable
	# a = randn(10^4)
	fit!(q, a)
	fit!(o, a)

# ╔═╡ d78b749b-a739-4fcf-a5c2-1c7d41fc11f0
	d1 = let
		# We just build an Average Shifted Histogram for the smoothed pdf of the histogram used by Quantile
		ash = Ash(, 1)
		x, y = value(ash) # This extracts x and y for the smoothed pdf
		y = cumsum(y)
		# eltype to convert the step (which is TwicePrecision) to the actual type of the elements of y
		y *= eltype(y)(x.step) # We have to normalize by the step on x to have the sum to 1
		scatter(;x,y, name = "ASH")
	d2 = let
		y = range(0, 1; length = 101)
		x = map(y) do y
			value(q, y)
		scatter(;x, y, name = "Quantile", line_dash = :dash)
	d3 = let
		y = range(0, 1; length = 101)
		x = map(y) do y
			quantile(o, y)
		scatter(;x, y, name = "OrderStats", line_dash = :dot)
	plot([d1, d2, d3], Layout(;
		template = "none",
		uirevision = 1,
		xaxis = attr(;
			title = "X"
		yaxis = attr(;
			title = "P{x <= X}"

I see that Ash works for smoothing but the actual curve from OrderStats is “closer” to the Quantile curve (see below example zoom of the CDF plot of the notebook):

1 Like

Thanks a lot for the answer @joshday
I get your point but I’d like to avoid increasing the number of bins in the quantile objects as I have potentially thousands of them and for everything else except “smoothness” of the ECDF plot I am more than fine with 500 bins.

I was just trying to find the best approach to smooth out the cdf extraction from the quantile in post-processing just before plotting.

Effectively, no. That is basically what I was suggesting. I do not know why it appears to have some bias. I could speculate (I think the smoothing likely inflates the tails of the density), but if it is important to you then you need to decide how to approach the tradeoff.

1 Like

Thanks for confirming this. Indeed I probably do not need bigger precision than what the Ash method gives me, I just posted this question mostly to get inputs/suggestions on whether there were even better approaches :slight_smile:

1 Like