Add more specialized methods to CategoricalArrays.jl?

This is a good suggestion. I think I didn’t implement an optimized broadcasted method because in most applications the function is relatively cheap to call, so the gain isn’t super large. If the function is really cheap and you have many levels compared to the number of elements, allocating a temporary vector holding the result (on_levels in your code) can even be slower than calling the function on each element. Of course there are cases where it’s much faster to call the function only once per level indeed, and that kind of optimization is at the core of CategoricalArrays, so it may be worth it and deserves benchmarking.

Another tricky point is that calling the function on an unused level might throw an error, e.g. because you try to look up the value in a dict which has keys only for used values. This shouldn’t be super common but it would be nice to fall back to the standard method in case of error to avoid this difference with plain Array.

Finally, compared to your my_broadcasted, I think an implementation in CategoricalArrays would have to:

  • call f on CategoricalValue objects instead of on levels themselves, i.e. on_levels = broadcast(i -> f(ca.pool[i]), 1:length(ca.pool))
  • choose the type of the returned array based on the type of on_levels, i.e. using similar(on_levels, size(ca))
  • if that type is CategoricalArray, use an optimized code path to compute refs directly, but without assuming that you can reuse refs from the original array and on_levels as levels, as there may be duplicates in on_levels
  • do not choose the type of integer reference codes based on the number of levels as it creates a type instability. We could have a compress argument to broadcast that does this, but in general it seems sufficient to use the same integer type as the input. The compress function has logic to do this already.

Note that the same approach should be used for map.

You’re welcome if you want to give it a try. But as you see it’s a bit complex, and it needs benchmarking to ensure we don’t end up worse than the current state in common use cases.

See also

1 Like