Best way to discretize continuous variable

danielw2904 · May 31, 2021, 8:08am

I have a continuous variable that I would like to discretize based on the sample quantiles (e.g. top 1%, >1%-top10%, etc). My intuition was to use the following

using StatsBase
x = sort(100 .*rand(100)); # sort just to see equivalent values in the same place
xf = ecdf(quantile(x, [0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.99]))
xc = categorical(round.(xf.(x), digits = 5))

I round the values to hopefully get rid of floating point inaccuracy but this seems a bit hacky. Is there a better way?

Tamas_Papp · May 31, 2021, 9:31am

I am not sure what you mean here, they are floats so in practice they will not coincide.

In any case, I would use something like

using StatsBase, CategoricalArrays
x = sort(100 .* rand(100))
breaks = quantile(x, [0.0, 0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.99, 1.0])
xc = cut(x, breaks; extend = true)

danielw2904 · May 31, 2021, 1:15pm

I meant in the discretized form the values will be equivalent even if they are not equal to begin with.

Thank you that works great!

Topic		Replies	Views
Discretize/Binning of Continuous Variable in Dataframe New to Julia	2	715	August 9, 2021
I want to bin numerical values into manually set percentile values, how can I do this? General Usage	2	885	March 7, 2020
Round CategoricalArrays.cut labels Data categoricalarrays	1	329	June 28, 2021
[ANN] Breakers.jl Package Announcements package	10	680	April 10, 2025
Creating data bins with numeric labels with cut() General Usage categoricalarrays	12	1721	March 17, 2023

Best way to discretize continuous variable

Related topics