Add more specialized methods to CategoricalArrays.jl?

Question to package authors of CategoricalArrays.jl

Is there a reason why you haven’t added more specialized methods to your package for, e.g., Base.Broadcasted.broadcasted ?

These are not passive-aggressive questions – I am literally wondering about what disadvantages motivated package authors not to do this. I see that you wrote specialized methods for broadcasted function for the ismissing function, which makes sense. Is there some tradeoff I’m not seeing for adding methods for generic functions? Alternatively, I would be happy to make a contribution to the package (even if it is a bit trivial).

Example:

using CategoricalArrays

   
rand_str(d::Integer) = prod(rand('a':'c', d)) |> string
N = 1_000_000
catarr = CategoricalArray([rand_str(10) for i in 1:N])         

# would add method to Base.Broadcast.broadcasted in package instead of creating new function my_broadcasted, but my example is avoiding type piracy
function my_broadcasted(f::Function, ca::CategoricalArray, v)
    on_levels = f.(ca.pool.levels, v)
    typeof(on_levels) <: Union{BitVector, Vector{Bool}} && return ca.refs .|> r -> on_levels[r]
    return _catarr_from_refs_levels(ca.refs, on_levels) # defined in next codeblock.  Kind of a distraction but I do have questions here as well. 
end 

function my_broadcasted(f::Function, ca::CategoricalArray)
    on_levels = f.(ca.pool.levels)
    typeof(on_levels) <: Union{BitVector, Vector{Bool}} && return ca.refs .|> r -> on_levels[r]
    return _catarr_from_refs_levels(ca.refs, on_levels)
end 

@elapsed isequal.(catarr,"abcabcabca")
@elapsed my_broadcasted(isequal, catarr, "abcabcabca") # an order of magnitude faster 

#contains.(catarr, "aaa") # will throw due to method not defined error on categorical value.  But it's still applying to each categorical value instead of to the levels.  
@elapsed my_broadcasted(contains, catarr, "aaa")

There’s no reason to repeatedly evaluate a function on a level, which is what base broadcasted will do. Instead, it makes a lot more sense to evaluate the function on each unique level and then map these results into the correct position on the arrays.

 """ Why doesn't this function already exist as a CategoricalArray constructor?""" 
function _catarr_from_refs_levels(refs::Array{I, D}, levs::Vector{S}; do_allowmissing::Bool=false, do_compress=true) where {I, S, D}
    if do_compress
        if length(levs) > typemax(I) 
            I2 = _smallest_int_type(levs)
            refs = I2.(refs)
        else 
            I2 = I
        end 
    else 
        I2 = I 
    end    
    cat_pool = CategoricalPool{S, I2}(levs)
    S2 = do_allowmissing ? Union{S, Missing} : S 
    cat_arr = CategoricalArray{S2, D}(refs, cat_pool)
    return cat_arr 
end 

# relatedly, are there functions in base julia that accomplish the below intent?  Not sure if I was reinventing the wheel. 
"""
Convert integers to the smallest type that can accomodate all values 
"""
function _smallest_int_type(_max::Unsigned)
    byte_sizes = [(8, UInt8), (16, UInt16), (32, UInt32), (64, UInt64), (128, UInt128)]
    for (c, t_c) in byte_sizes 
        _max < 2^c && return t_c 
    end  
    return throw(AssertionError("Largest integer $_max cannot be accomodated by UInt128"))
end 

function _smallest_int_type(_max::Signed)
    byte_sizes = [(8, Int8), (16, Int16), (32, Int32), (64, Int64), (128, Int128)]
    for (c, t_c) in byte_sizes 
        _max < 2^c && return t_c 
    end  
    return throw(AssertionError("Largest integer $_max cannot be accomodated by Int128"))
end 

_smallest_int_type(arr::AbstractArray{MS} where MS <: Union{Missing, S}; safety_factor=1.01) where S <: Signed = 
    (max(maximum(skipmissing(arr); init=S(0)), abs(minimum(skipmissing(arr); init=S(0))) ))   |> 
    (x -> S(round(safety_factor * x))) |> 
    _smallest_int_type

_smallest_int_type(arr::AbstractArray{MU} where MU <: Union{Missing, U}) where U <: Unsigned = 
    maximum(skipmissing(arr); init=U(0)) |> 
    _smallest_int_type

_smallest_int_type(arr::AbstractArray{S} where S <: Union{String, AbstractFloat}) = arr |> unique |> length |> Unsigned |> _smallest_int_type # in this case, for determining ref type for levels of many categorical arrays
_smallest_int_type(arr::AbstractArray{Union{}}) = UInt8
_smallest_int_type(arr::Union{MissingVector, AbstractArray{<:Missing}}) = UInt8

times(x) = v -> v * x
#times("A").(catarr)
@elapsed my_broadcasted(times("A"), catarr)

When I do heavy data cleaning on large data sets, I write functions to accomplish the above. But it feels like functions I am writing should be built into the CategoricalArrays.jl package – otherwise, broadcasting is a performance footgun. As a user, I am either force to perform type piracy, or write my own helper functions and avoid broadcasting syntax.

Maybe the reason these functions do not exist is that it woudl make it harder to coerce the categorical array into e.g. a Vector{String}?

string.(catarr)
my_broadcasted(string, catarr) # if changed broadcast method, then no longer convert to Vector{String} type. 
catarr |> Vector{String} # but can still do this, which seems more semantically appropriate. 

This is a good suggestion. I think I didn’t implement an optimized broadcasted method because in most applications the function is relatively cheap to call, so the gain isn’t super large. If the function is really cheap and you have many levels compared to the number of elements, allocating a temporary vector holding the result (on_levels in your code) can even be slower than calling the function on each element. Of course there are cases where it’s much faster to call the function only once per level indeed, and that kind of optimization is at the core of CategoricalArrays, so it may be worth it and deserves benchmarking.

Another tricky point is that calling the function on an unused level might throw an error, e.g. because you try to look up the value in a dict which has keys only for used values. This shouldn’t be super common but it would be nice to fall back to the standard method in case of error to avoid this difference with plain Array.

Finally, compared to your my_broadcasted, I think an implementation in CategoricalArrays would have to:

  • call f on CategoricalValue objects instead of on levels themselves, i.e. on_levels = broadcast(i -> f(ca.pool[i]), 1:length(ca.pool))
  • choose the type of the returned array based on the type of on_levels, i.e. using similar(on_levels, size(ca))
  • if that type is CategoricalArray, use an optimized code path to compute refs directly, but without assuming that you can reuse refs from the original array and on_levels as levels, as there may be duplicates in on_levels
  • do not choose the type of integer reference codes based on the number of levels as it creates a type instability. We could have a compress argument to broadcast that does this, but in general it seems sufficient to use the same integer type as the input. The compress function has logic to do this already.

Note that the same approach should be used for map.

You’re welcome if you want to give it a try. But as you see it’s a bit complex, and it needs benchmarking to ensure we don’t end up worse than the current state in common use cases.

See also

1 Like

Cool, thanks!

I might give it a try sometime this week or weekend. It’s about time that I contributed to Julia’s ecosystem.

I’ve never contributed to another person’s package before. What is the protocol for this? is there a way to add a package, branch it, use the branched version, modify code of branched version, commit and push it to the github repo?

Yes - you fork the package on Github, then dev your fork, commit your changes to that fork and create a pull request to upstream them.