Get smooth CDF from OnlineStats's Quantile

disberd · May 6, 2024, 8:50am

Hello,

I am tracking statistics of some simulations using the Quantile object from the amazing OnlineStats.jl

I am also interested in being able to plot the CDF of the variable being observed with the Quantile but trying to do so directly with a code like so

using OnlineStats
q = fit!(Quantile(), randn(10^4))
# Get the CDF points
y = range(0,1; length = 101)
x = map(z -> value(q, z), y)

will create a CDF which is not smooth.

I could also fit! my data through an OrderStats(100) object but I have the impression that I should be able to reconstruct a smooth CDF from a Quantile (considering it holds an histogram with 500 bins) withouth the need of keeping a separate OrdersStats just for plotting purposes.

I have seen that I can get something more smooth by wrapping the Quantile’s histogram in the Ash object still from OnlineStats to get a smoothed PDF and integrate that, but is there maybe a better method to get the smoothed CDF from my Quantile object?

I am tagging @joshday as he is the main author of OnlineStats so he might already know the best approach

joshday · May 6, 2024, 11:35am

Quantiles are estimated from a histogram, which will have jumps. I think the right way to plot it would be Plots.plot(x, y, seriestype=:step). If you need a smoother estimate, you’ll need to add more bins in the histogram, e.g. Quantile(b = 5000) (default is 500).

tbeason · May 6, 2024, 1:03pm

If you wanted it to be actually smooth, I’d use a kernel density first and then integrate it to get the CDF.

disberd · May 6, 2024, 1:27pm

Thanks @tbeason, I actually found the Ash function from this other somehow related post:

I understand that the AverageShiftedHistogram is somehow something that does similarly to the kernel density you mentioned, so using that and integrating its output was indeed the first thing I tried as mentioned in the original post.

Do you think this approach is actually different from what you are suggesting?

I was actually trying out the different options in a Pluto notebook:

Notebook Code

# ╔═╡ 21f03ec8-4e76-4e7c-8cac-3e67d9e79e44
begin
	using OnlineStats
	using PlutoPlotly
	using Statistics
end

# ╔═╡ dff0f3f2-6706-4f60-a607-1fcff9b3b314
begin
	db2lin(x) = 10.0^(x/10)
	q = Quantile()
	o = OrderStats(100)
	a = db2lin.(randn(10^5) .* 2) # Let's create some lognormal variable
	# a = randn(10^4)
	fit!(q, a)
	fit!(o, a)
end

# ╔═╡ d78b749b-a739-4fcf-a5c2-1c7d41fc11f0
let
	d1 = let
		# We just build an Average Shifted Histogram for the smoothed pdf of the histogram used by Quantile
		ash = Ash(q.eh, 1)
		x, y = value(ash) # This extracts x and y for the smoothed pdf
		y = cumsum(y)
		# eltype to convert the step (which is TwicePrecision) to the actual type of the elements of y
		y *= eltype(y)(x.step) # We have to normalize by the step on x to have the sum to 1
		scatter(;x,y, name = "ASH")
	end
	d2 = let
		y = range(0, 1; length = 101)
		x = map(y) do y
			value(q, y)
		end
		scatter(;x, y, name = "Quantile", line_dash = :dash)
	end
	d3 = let
		y = range(0, 1; length = 101)
		x = map(y) do y
			quantile(o, y)
		end
		scatter(;x, y, name = "OrderStats", line_dash = :dot)
	end
	plot([d1, d2, d3], Layout(;
		template = "none",
		uirevision = 1,
		xaxis = attr(;
			title = "X"
		),
		yaxis = attr(;
			title = "P{x <= X}"
		)
	))
end

I see that Ash works for smoothing but the actual curve from OrderStats is “closer” to the Quantile curve (see below example zoom of the CDF plot of the notebook):

disberd · May 6, 2024, 1:30pm

Thanks a lot for the answer @joshday
I get your point but I’d like to avoid increasing the number of bins in the quantile objects as I have potentially thousands of them and for everything else except “smoothness” of the ECDF plot I am more than fine with 500 bins.

I was just trying to find the best approach to smooth out the cdf extraction from the quantile in post-processing just before plotting.

tbeason · May 6, 2024, 1:38pm

Effectively, no. That is basically what I was suggesting. I do not know why it appears to have some bias. I could speculate (I think the smoothing likely inflates the tails of the density), but if it is important to you then you need to decide how to approach the tradeoff.

disberd · May 6, 2024, 1:41pm

Thanks for confirming this. Indeed I probably do not need bigger precision than what the Ash method gives me, I just posted this question mostly to get inputs/suggestions on whether there were even better approaches

Topic		Replies	Views
Kernel Density Estimate for cdf General Usage question , statistics	11	1736	September 26, 2022
Interpolations.jl Discrete CDF to PDF Optimization (Mathematical) question , package , diffeq	10	1290	May 4, 2021
ANN: EmpiricalCDFs.jl Statistics	0	677	April 18, 2018
How do I plot the estimated cumulative density function of some samples? General Usage plotting , gadfly , plots	6	6306	April 23, 2020
Can't get quantiles of the Kolmogorov-Smirnov (KSDist) distribution Statistics	2	1106	November 14, 2019

Get smooth CDF from OnlineStats's Quantile

Related topics