Is there a function that extracts the index from an element of a CategoricalArray created by the function `cut`?

The function cut in the CategoricalArray package returns a CategoricalArray object, as in this example:

x = rand(10)
ct = cut(x, 3)

"Q3: [0.735346, 0.972399]"
 "Q2: [0.335188, 0.735346)"
 "Q1: [0.00586079, 0.335188)"
 "Q3: [0.735346, 0.972399]"
 "Q3: [0.735346, 0.972399]"
 "Q1: [0.00586079, 0.335188)"
 "Q3: [0.735346, 0.972399]"
 "Q1: [0.00586079, 0.335188)"
 "Q2: [0.335188, 0.735346)"
 "Q2: [0.335188, 0.735346)"

One can convert the elements of the CategoricalArray object into strings:

string(ct[1])
"Q3: [0.735346, 0.972399]"

It is easy to write a function that would extract the index from such a string:

foo(ct[1])
3

or alternatively return the interval:

foo(ct[1])
(0.735346, 0.972399) # a tuple, not a string

Given the latter function another function that would find whether a value lies in the interval could be easily written:

bar(0.8, ct)
true

bar(2.0, ct)
false

I hope such functions exist and are faster than the functions I would write. However, I could not find references to either one. Do such functions exist?

I think you want levelcode

Thanks. But as far as can gather levelcode only accepts a CategoricalValue object (or Missing) as argument. I need to find the interval where an arbitrary Float64 is.

I realized that my original question was not clear. My apologies. I have rewritten the question, hope that makes things clearer.

I think it’s still not clear, because you seem to want that information from the ct array elements. I think, instead, you mean to be passing x[1] to your proposed functions. Anyway, the bounds of the categories do not seem to preserved in the CategoricalArray. But this is the line where those are calculated https://github.com/JuliaData/CategoricalArrays.jl/blob/fad8dc4d7fa5e6bf9ea30cd098e0e4cc99ff5eeb/src/extras.jl#L241. You could call quantile yourself and use its results.

You are right that I could call quantile but I think that would mean it would be called twice. I also need to call cut and cut calls quantile. I wanted to avoid that for efficiency reasons. Calling quantile is expensive.

Since cut calls quantile there could be a version of cut that would also return the return value of quantile. Alternatively, there could exist a function that would do what I was asking for: extract the return value of quantile from the return value of cut.

Still another alternative would be to have a version of cut that would have a parameter breaks (replacing ngroups). Then one would first call quantile and its return value would be used as the argument for the breaks parameter when calling this version of cut. Maybe the easiest way would be to write such a version of cut. Something along these lines:


function mycut(v::Vector{Float64}, breaks::Vector{Float64})::Vector{Int64}
    @assert is_monotonic(breaks) "Argument `breaks` should be strictly monotonically increasing"

    @assert breaks[1] <= minimum(x) "The first element of `breaks` should be less than or equal to the smallest element of `x`."
    @assert breaks[end] >= maximum(x) "The last element of `breaks` should be greater than or equal to the greatest element of `x`."

    y::Vector{Int64} = fill(0, length(x))

    lenb = length(breaks)

    for i in eachindex(v)
        for b in 2:lenb
            if v[i] < breaks[b]
                y[i] = b - 1
                break
            end
        end
    end

    return y
end

It would be used in this way:

x = rand(20);
mycut(x, [minimum(x), 0.2, 0.4, 0.6, 0.8, maximum(x)])
20-element Vector{Int64}:
 3
 2
 1
 1
 2
 4
 4
 2
 4
 4
 0
 1
 2
 2
 2
 3
 3
 5
 3
 4

There are already methods for cut in CategoricalArrays that accept breaks. If you need both range checking on the breaks and the categorical array, you could create a higher level type that knows both and provides the combined functionality.