I’m assuming that there exists convenient syntax to performantly vectorize string operations on categorical arrays?
Suppose we want to subset a categorical array to the values whose labels satisfy some condition. Consider the examples:
using CategoricalArrays
strarr = ["bob", "job", "toby", "tim", "jon"]
catarr = CategoricalArray(strarr)
contains.(strarr, "o")
endswith.(strarr, "ob")
strarr .== "bob"
strarr .\in [["bob", "tim"]]
# contains.(catarr, "o") # does not work - and, even though the intent is clear, it would be computationally inefficient if it did work .
# endswith.(catarr, "ob") # does not work - same commentary as above
catarr .== "bob" # works, but is probably not efficient.
catarr .\in [["bob", "tim"]] # works, but is probably not efficient.
In all of these cases, what you want vectorization to do is something like
function contains(ca::CategoricalArray, x)
refs_contain = [i for (i, v) in enumerate(ca.pool.levels) if contains(v, x)]
return ca.refs .\in [refs_contain]
end
Strictly speaking, the semantics of the above function are wrong - contains should not be defined on a vector like the above. It should use the dot syntax.
However, the dot syntax will apply the function to each categorical value. If the function is slow, it would be better to evaluate it on the smaller array of levels and grab the corresponding references.
Does there exist syntax that I have overlooked that executes this? It’s not hard to write my own functions to do this, but I have this nagging feeling that there should be a simple, consistent syntax in the CategoricalArrays package to get performant vectorization on categorical arrays.