# Best way to discretize continuous variable

I have a continuous variable that I would like to discretize based on the sample quantiles (e.g. top 1%, >1%-top10%, etc). My intuition was to use the following

``````using StatsBase
x = sort(100 .*rand(100)); # sort just to see equivalent values in the same place
xf = ecdf(quantile(x, [0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.99]))
xc = categorical(round.(xf.(x), digits = 5))
``````

I round the values to hopefully get rid of floating point inaccuracy but this seems a bit hacky. Is there a better way?

I am not sure what you mean here, they are floats so in practice they will not coincide.

In any case, I would use something like

``````using StatsBase, CategoricalArrays
x = sort(100 .* rand(100))
breaks = quantile(x, [0.0, 0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.99, 1.0])
xc = cut(x, breaks; extend = true)
``````
3 Likes

I meant in the discretized form the values will be equivalent even if they are not equal to begin with.

Thank you that works great!