Add more specialized methods to CategoricalArrays.jl?

nalimilan · September 4, 2024, 5:11pm

This is a good suggestion. I think I didn’t implement an optimized broadcasted method because in most applications the function is relatively cheap to call, so the gain isn’t super large. If the function is really cheap and you have many levels compared to the number of elements, allocating a temporary vector holding the result (on_levels in your code) can even be slower than calling the function on each element. Of course there are cases where it’s much faster to call the function only once per level indeed, and that kind of optimization is at the core of CategoricalArrays, so it may be worth it and deserves benchmarking.

Another tricky point is that calling the function on an unused level might throw an error, e.g. because you try to look up the value in a dict which has keys only for used values. This shouldn’t be super common but it would be nice to fall back to the standard method in case of error to avoid this difference with plain Array.

Finally, compared to your my_broadcasted, I think an implementation in CategoricalArrays would have to:

call f on CategoricalValue objects instead of on levels themselves, i.e. on_levels = broadcast(i -> f(ca.pool[i]), 1:length(ca.pool))
choose the type of the returned array based on the type of on_levels, i.e. using similar(on_levels, size(ca))
if that type is CategoricalArray, use an optimized code path to compute refs directly, but without assuming that you can reuse refs from the original array and on_levels as levels, as there may be duplicates in on_levels
do not choose the type of integer reference codes based on the number of levels as it creates a type instability. We could have a compress argument to broadcast that does this, but in general it seems sufficient to use the same integer type as the input. The compress function has logic to do this already.

Note that the same approach should be used for map.

You’re welcome if you want to give it a try. But as you see it’s a bit complex, and it needs benchmarking to ensure we don’t end up worse than the current state in common use cases.

Efficient broadcast comparison

opened 06:24PM - 29 Jun 18 UTC

cstjean

On 0.6, ```julia julia> heys1 = fill("Hey", 10000); julia> heys3 = Catego…ricalVector(heys1); julia> @btime heys1 .== "Hey"; 62.509 μs (21 allocations: 6.19 KiB) julia> @btime heys3 .== "Hey"; 101.900 μs (21 allocations: 6.19 KiB) ``` Theoretically the second comparison should be faster, since it should boil down to comparing integers (or, in this case, realizing that no value in the pool is `==`). Is there a way of implementing this? I'm not familiar with broadcasting innards.

Topic		Replies	Views
CategoricalArrays.jl syntax question: vectorizing string operation Data question , categoricalarrays	7	213	July 18, 2024
What is the role of Ref in this function and why is it faster than the alternative? General Usage question	6	386	June 14, 2023
Categorical to Integer values? New to Julia	16	5312	March 18, 2022
How to index a `CatagoricalArray` then make a new array with the same levels Data	4	573	August 15, 2018
Invalidation free copyto! depending on destination? General Usage	1	271	September 27, 2022

Add more specialized methods to CategoricalArrays.jl?

Related topics