The function cut
in the CategoricalArray
package returns a CategoricalArray object, as in this example:
x = rand(10)
ct = cut(x, 3)
"Q3: [0.735346, 0.972399]"
"Q2: [0.335188, 0.735346)"
"Q1: [0.00586079, 0.335188)"
"Q3: [0.735346, 0.972399]"
"Q3: [0.735346, 0.972399]"
"Q1: [0.00586079, 0.335188)"
"Q3: [0.735346, 0.972399]"
"Q1: [0.00586079, 0.335188)"
"Q2: [0.335188, 0.735346)"
"Q2: [0.335188, 0.735346)"
One can convert the elements of the CategoricalArray object into strings:
string(ct[1])
"Q3: [0.735346, 0.972399]"
It is easy to write a function that would extract the index from such a string:
foo(ct[1])
3
or alternatively return the interval:
foo(ct[1])
(0.735346, 0.972399) # a tuple, not a string
Given the latter function another function that would find whether a value lies in the interval could be easily written:
bar(0.8, ct)
true
bar(2.0, ct)
false
I hope such functions exist and are faster than the functions I would write. However, I could not find references to either one. Do such functions exist?
I think you want levelcode
Thanks. But as far as can gather levelcode
only accepts a CategoricalValue object (or Missing) as argument. I need to find the interval where an arbitrary Float64 is.
I realized that my original question was not clear. My apologies. I have rewritten the question, hope that makes things clearer.
I think it’s still not clear, because you seem to want that information from the ct array elements. I think, instead, you mean to be passing x[1] to your proposed functions. Anyway, the bounds of the categories do not seem to preserved in the CategoricalArray. But this is the line where those are calculated https://github.com/JuliaData/CategoricalArrays.jl/blob/fad8dc4d7fa5e6bf9ea30cd098e0e4cc99ff5eeb/src/extras.jl#L241. You could call quantile yourself and use its results.
You are right that I could call quantile
but I think that would mean it would be called twice. I also need to call cut
and cut
calls quantile
. I wanted to avoid that for efficiency reasons. Calling quantile
is expensive.
Since cut
calls quantile
there could be a version of cut
that would also return the return value of quantile
. Alternatively, there could exist a function that would do what I was asking for: extract the return value of quantile
from the return value of cut
.
Still another alternative would be to have a version of cut
that would have a parameter breaks
(replacing ngroups
). Then one would first call quantile
and its return value would be used as the argument for the breaks
parameter when calling this version of cut
. Maybe the easiest way would be to write such a version of cut
. Something along these lines:
function mycut(v::Vector{Float64}, breaks::Vector{Float64})::Vector{Int64}
@assert is_monotonic(breaks) "Argument `breaks` should be strictly monotonically increasing"
@assert breaks[1] <= minimum(x) "The first element of `breaks` should be less than or equal to the smallest element of `x`."
@assert breaks[end] >= maximum(x) "The last element of `breaks` should be greater than or equal to the greatest element of `x`."
y::Vector{Int64} = fill(0, length(x))
lenb = length(breaks)
for i in eachindex(v)
for b in 2:lenb
if v[i] < breaks[b]
y[i] = b - 1
break
end
end
end
return y
end
It would be used in this way:
x = rand(20);
mycut(x, [minimum(x), 0.2, 0.4, 0.6, 0.8, maximum(x)])
20-element Vector{Int64}:
3
2
1
1
2
4
4
2
4
4
0
1
2
2
2
3
3
5
3
4
There are already methods for cut
in CategoricalArrays that accept breaks
. If you need both range checking on the breaks and the categorical array, you could create a higher level type that knows both and provides the combined functionality.