Question to package authors of CategoricalArrays.jl
Is there a reason why you haven’t added more specialized methods to your package for, e.g., Base.Broadcasted.broadcasted ?
These are not passive-aggressive questions – I am literally wondering about what disadvantages motivated package authors not to do this. I see that you wrote specialized methods for broadcasted function for the ismissing function, which makes sense. Is there some tradeoff I’m not seeing for adding methods for generic functions? Alternatively, I would be happy to make a contribution to the package (even if it is a bit trivial).
Example:
using CategoricalArrays
rand_str(d::Integer) = prod(rand('a':'c', d)) |> string
N = 1_000_000
catarr = CategoricalArray([rand_str(10) for i in 1:N])
# would add method to Base.Broadcast.broadcasted in package instead of creating new function my_broadcasted, but my example is avoiding type piracy
function my_broadcasted(f::Function, ca::CategoricalArray, v)
on_levels = f.(ca.pool.levels, v)
typeof(on_levels) <: Union{BitVector, Vector{Bool}} && return ca.refs .|> r -> on_levels[r]
return _catarr_from_refs_levels(ca.refs, on_levels) # defined in next codeblock. Kind of a distraction but I do have questions here as well.
end
function my_broadcasted(f::Function, ca::CategoricalArray)
on_levels = f.(ca.pool.levels)
typeof(on_levels) <: Union{BitVector, Vector{Bool}} && return ca.refs .|> r -> on_levels[r]
return _catarr_from_refs_levels(ca.refs, on_levels)
end
@elapsed isequal.(catarr,"abcabcabca")
@elapsed my_broadcasted(isequal, catarr, "abcabcabca") # an order of magnitude faster
#contains.(catarr, "aaa") # will throw due to method not defined error on categorical value. But it's still applying to each categorical value instead of to the levels.
@elapsed my_broadcasted(contains, catarr, "aaa")
There’s no reason to repeatedly evaluate a function on a level, which is what base broadcasted will do. Instead, it makes a lot more sense to evaluate the function on each unique level and then map these results into the correct position on the arrays.
""" Why doesn't this function already exist as a CategoricalArray constructor?"""
function _catarr_from_refs_levels(refs::Array{I, D}, levs::Vector{S}; do_allowmissing::Bool=false, do_compress=true) where {I, S, D}
if do_compress
if length(levs) > typemax(I)
I2 = _smallest_int_type(levs)
refs = I2.(refs)
else
I2 = I
end
else
I2 = I
end
cat_pool = CategoricalPool{S, I2}(levs)
S2 = do_allowmissing ? Union{S, Missing} : S
cat_arr = CategoricalArray{S2, D}(refs, cat_pool)
return cat_arr
end
# relatedly, are there functions in base julia that accomplish the below intent? Not sure if I was reinventing the wheel.
"""
Convert integers to the smallest type that can accomodate all values
"""
function _smallest_int_type(_max::Unsigned)
byte_sizes = [(8, UInt8), (16, UInt16), (32, UInt32), (64, UInt64), (128, UInt128)]
for (c, t_c) in byte_sizes
_max < 2^c && return t_c
end
return throw(AssertionError("Largest integer $_max cannot be accomodated by UInt128"))
end
function _smallest_int_type(_max::Signed)
byte_sizes = [(8, Int8), (16, Int16), (32, Int32), (64, Int64), (128, Int128)]
for (c, t_c) in byte_sizes
_max < 2^c && return t_c
end
return throw(AssertionError("Largest integer $_max cannot be accomodated by Int128"))
end
_smallest_int_type(arr::AbstractArray{MS} where MS <: Union{Missing, S}; safety_factor=1.01) where S <: Signed =
(max(maximum(skipmissing(arr); init=S(0)), abs(minimum(skipmissing(arr); init=S(0))) )) |>
(x -> S(round(safety_factor * x))) |>
_smallest_int_type
_smallest_int_type(arr::AbstractArray{MU} where MU <: Union{Missing, U}) where U <: Unsigned =
maximum(skipmissing(arr); init=U(0)) |>
_smallest_int_type
_smallest_int_type(arr::AbstractArray{S} where S <: Union{String, AbstractFloat}) = arr |> unique |> length |> Unsigned |> _smallest_int_type # in this case, for determining ref type for levels of many categorical arrays
_smallest_int_type(arr::AbstractArray{Union{}}) = UInt8
_smallest_int_type(arr::Union{MissingVector, AbstractArray{<:Missing}}) = UInt8
times(x) = v -> v * x
#times("A").(catarr)
@elapsed my_broadcasted(times("A"), catarr)
When I do heavy data cleaning on large data sets, I write functions to accomplish the above. But it feels like functions I am writing should be built into the CategoricalArrays.jl package – otherwise, broadcasting is a performance footgun. As a user, I am either force to perform type piracy, or write my own helper functions and avoid broadcasting syntax.
Maybe the reason these functions do not exist is that it woudl make it harder to coerce the categorical array into e.g. a Vector{String}?
string.(catarr)
my_broadcasted(string, catarr) # if changed broadcast method, then no longer convert to Vector{String} type.
catarr |> Vector{String} # but can still do this, which seems more semantically appropriate.