CategoricalArrays.jl syntax question: vectorizing string operation

croberts · July 16, 2024, 3:26pm

I’m assuming that there exists convenient syntax to performantly vectorize string operations on categorical arrays?

Suppose we want to subset a categorical array to the values whose labels satisfy some condition. Consider the examples:

using CategoricalArrays
strarr = ["bob", "job", "toby", "tim", "jon"] 
catarr = CategoricalArray(strarr)

contains.(strarr, "o") 
endswith.(strarr, "ob") 
strarr .== "bob"
strarr .\in [["bob", "tim"]] 

# contains.(catarr, "o") # does not work - and, even though the intent is clear, it would be computationally inefficient if it did work . 
# endswith.(catarr, "ob") # does not work - same commentary as above
catarr .== "bob" # works, but is probably not efficient.  
catarr .\in [["bob", "tim"]] # works, but is probably not efficient.

In all of these cases, what you want vectorization to do is something like

function contains(ca::CategoricalArray, x) 
   refs_contain = [i for (i, v) in enumerate(ca.pool.levels) if contains(v, x)]
   return ca.refs .\in [refs_contain]
end

Strictly speaking, the semantics of the above function are wrong - contains should not be defined on a vector like the above. It should use the dot syntax.
However, the dot syntax will apply the function to each categorical value. If the function is slow, it would be better to evaluate it on the smaller array of levels and grab the corresponding references.

Does there exist syntax that I have overlooked that executes this? It’s not hard to write my own functions to do this, but I have this nagging feeling that there should be a simple, consistent syntax in the CategoricalArrays package to get performant vectorization on categorical arrays.

juliohm · July 16, 2024, 4:29pm

I believe that conceptual model behind CategoricalArrays.jl is one in which values are atomic symbols. You can’t treat categorical values as if they were strings.

You have two options:

Process strings in a vector of strings, relying on functions like contains, etc.
Process categorical values using the CategoricalArrays.jl interface

Because I am a mere user of the package, this understanding may not reflect the conceptual model adopted by the authors. Others can help with additional input.

aplavin · July 16, 2024, 4:32pm

Aren’t you looking for PooledArrays.jl effectively? Unlike CategoricalArrays, it’s intended as a drop-in replacement for a regular array, just more efficient.
Not sure whether efficient mapping/broadcasting is implemented now, but it could easily be: then, stuff like containts.(arr, "o") would only run actual computations for unique values, not for each elemenet.

rafael.guerra · July 16, 2024, 5:20pm

For the record, this package is not one of the large number of array packages listed in JuliaArrays.

pdeffebach · July 16, 2024, 7:57pm

Yes, it’s maintained by the JuliaData organization, not the JuliaArrays organizations. The reason for this is that it’s used heavily by the data packages and the maintainers mostly maintain other data packages.

croberts · July 17, 2024, 2:59pm

Thanks. I know that CategoricalArrays and PooledArrays are similar. My impression is that PooledArrays is lower overhead, and CategoricalArrays has slightly more features (such as ordering levels). What are the tradeoffs using these two packages, which do similar things?

pdeffebach · July 17, 2024, 5:36pm

No. They aren’t interchangeable.

PooledArrays are if you have a small set of numbers that are repeated many times in your vector. You want to reduce the memory footprint of that vector but have no other changes in behavior.

CategoricalArrays are for when you have categorical data. For example, if your data takes the vales 1, 2, 3, 4, but 1 represents “Car” and 2 represents “Bicycle”. You don’t want to add 1 and 2!

croberts · July 18, 2024, 2:53pm

Really helpful, thanks!

Topic		Replies	Views
Add more specialized methods to CategoricalArrays.jl? Data categoricalarrays	4	105	September 5, 2024
Tanslate code from PooledDataArray to CategoricalArray General Usage	3	796	January 29, 2018
Operations By Computed CategoricalArrays (lapply-split-mean) New to Julia	9	1066	February 19, 2018
How to index a `CatagoricalArray` then make a new array with the same levels Data	4	536	August 15, 2018
Is there a way that I can attach string labels to integer values in CategoricalArrays? Data question	9	2492	February 11, 2018

CategoricalArrays.jl syntax question: vectorizing string operation

Related topics