CategoricalArrays.jl syntax question: vectorizing string operation

I’m assuming that there exists convenient syntax to performantly vectorize string operations on categorical arrays?

Suppose we want to subset a categorical array to the values whose labels satisfy some condition. Consider the examples:

using CategoricalArrays
strarr = ["bob", "job", "toby", "tim", "jon"] 
catarr = CategoricalArray(strarr)

contains.(strarr, "o") 
endswith.(strarr, "ob") 
strarr .== "bob"
strarr .\in [["bob", "tim"]] 

# contains.(catarr, "o") # does not work - and, even though the intent is clear, it would be computationally inefficient if it did work . 
# endswith.(catarr, "ob") # does not work - same commentary as above
catarr .== "bob" # works, but is probably not efficient.  
catarr .\in [["bob", "tim"]] # works, but is probably not efficient.  

In all of these cases, what you want vectorization to do is something like

function contains(ca::CategoricalArray, x) 
   refs_contain = [i for (i, v) in enumerate(ca.pool.levels) if contains(v, x)]
   return ca.refs .\in [refs_contain]
end 

Strictly speaking, the semantics of the above function are wrong - contains should not be defined on a vector like the above. It should use the dot syntax.
However, the dot syntax will apply the function to each categorical value. If the function is slow, it would be better to evaluate it on the smaller array of levels and grab the corresponding references.

Does there exist syntax that I have overlooked that executes this? It’s not hard to write my own functions to do this, but I have this nagging feeling that there should be a simple, consistent syntax in the CategoricalArrays package to get performant vectorization on categorical arrays.

I believe that conceptual model behind CategoricalArrays.jl is one in which values are atomic symbols. You can’t treat categorical values as if they were strings.

You have two options:

  1. Process strings in a vector of strings, relying on functions like contains, etc.
  2. Process categorical values using the CategoricalArrays.jl interface

Because I am a mere user of the package, this understanding may not reflect the conceptual model adopted by the authors. Others can help with additional input.

Aren’t you looking for PooledArrays.jl effectively? Unlike CategoricalArrays, it’s intended as a drop-in replacement for a regular array, just more efficient.
Not sure whether efficient mapping/broadcasting is implemented now, but it could easily be: then, stuff like containts.(arr, "o") would only run actual computations for unique values, not for each elemenet.

For the record, this package is not one of the large number of array packages listed in JuliaArrays.

Yes, it’s maintained by the JuliaData organization, not the JuliaArrays organizations. The reason for this is that it’s used heavily by the data packages and the maintainers mostly maintain other data packages.

1 Like

Thanks. I know that CategoricalArrays and PooledArrays are similar. My impression is that PooledArrays is lower overhead, and CategoricalArrays has slightly more features (such as ordering levels). What are the tradeoffs using these two packages, which do similar things?

No. They aren’t interchangeable.

PooledArrays are if you have a small set of numbers that are repeated many times in your vector. You want to reduce the memory footprint of that vector but have no other changes in behavior.

CategoricalArrays are for when you have categorical data. For example, if your data takes the vales 1, 2, 3, 4, but 1 represents “Car” and 2 represents “Bicycle”. You don’t want to add 1 and 2!

1 Like

Really helpful, thanks!