I am tracking statistics of some simulations using the Quantile object from the amazing OnlineStats.jl
I am also interested in being able to plot the CDF of the variable being observed with the Quantile but trying to do so directly with a code like so
using OnlineStats
q = fit!(Quantile(), randn(10^4))
# Get the CDF points
y = range(0,1; length = 101)
x = map(z -> value(q, z), y)
will create a CDF which is not smooth.
I could also fit! my data through an OrderStats(100) object but I have the impression that I should be able to reconstruct a smooth CDF from a Quantile (considering it holds an histogram with 500 bins) withouth the need of keeping a separate OrdersStats just for plotting purposes.
I have seen that I can get something more smooth by wrapping the Quantile’s histogram in the Ash object still from OnlineStats to get a smoothed PDF and integrate that, but is there maybe a better method to get the smoothed CDF from my Quantile object?
I am tagging @joshday as he is the main author of OnlineStats so he might already know the best approach
Quantiles are estimated from a histogram, which will have jumps. I think the right way to plot it would be Plots.plot(x, y, seriestype=:step). If you need a smoother estimate, you’ll need to add more bins in the histogram, e.g. Quantile(b = 5000) (default is 500).
Thanks @tbeason, I actually found the Ash function from this other somehow related post:
I understand that the AverageShiftedHistogram is somehow something that does similarly to the kernel density you mentioned, so using that and integrating its output was indeed the first thing I tried as mentioned in the original post.
Do you think this approach is actually different from what you are suggesting?
I was actually trying out the different options in a Pluto notebook:
Notebook Code
# ╔═╡ 21f03ec8-4e76-4e7c-8cac-3e67d9e79e44
begin
using OnlineStats
using PlutoPlotly
using Statistics
end
# ╔═╡ dff0f3f2-6706-4f60-a607-1fcff9b3b314
begin
db2lin(x) = 10.0^(x/10)
q = Quantile()
o = OrderStats(100)
a = db2lin.(randn(10^5) .* 2) # Let's create some lognormal variable
# a = randn(10^4)
fit!(q, a)
fit!(o, a)
end
# ╔═╡ d78b749b-a739-4fcf-a5c2-1c7d41fc11f0
let
d1 = let
# We just build an Average Shifted Histogram for the smoothed pdf of the histogram used by Quantile
ash = Ash(q.eh, 1)
x, y = value(ash) # This extracts x and y for the smoothed pdf
y = cumsum(y)
# eltype to convert the step (which is TwicePrecision) to the actual type of the elements of y
y *= eltype(y)(x.step) # We have to normalize by the step on x to have the sum to 1
scatter(;x,y, name = "ASH")
end
d2 = let
y = range(0, 1; length = 101)
x = map(y) do y
value(q, y)
end
scatter(;x, y, name = "Quantile", line_dash = :dash)
end
d3 = let
y = range(0, 1; length = 101)
x = map(y) do y
quantile(o, y)
end
scatter(;x, y, name = "OrderStats", line_dash = :dot)
end
plot([d1, d2, d3], Layout(;
template = "none",
uirevision = 1,
xaxis = attr(;
title = "X"
),
yaxis = attr(;
title = "P{x <= X}"
)
))
end
I see that Ash works for smoothing but the actual curve from OrderStats is “closer” to the Quantile curve (see below example zoom of the CDF plot of the notebook):
Thanks a lot for the answer @joshday
I get your point but I’d like to avoid increasing the number of bins in the quantile objects as I have potentially thousands of them and for everything else except “smoothness” of the ECDF plot I am more than fine with 500 bins.
I was just trying to find the best approach to smooth out the cdf extraction from the quantile in post-processing just before plotting.
Effectively, no. That is basically what I was suggesting. I do not know why it appears to have some bias. I could speculate (I think the smoothing likely inflates the tails of the density), but if it is important to you then you need to decide how to approach the tradeoff.
Thanks for confirming this. Indeed I probably do not need bigger precision than what the Ash method gives me, I just posted this question mostly to get inputs/suggestions on whether there were even better approaches