What is the role of Ref in this function and why is it faster than the alternative?

This function converts a categorical vector to an index:

function convert_factor_to_index(cat_vec::CategoricalVector)::Vector{Int}
        levs = levels(cat_vec)

        int_vec::Vector{Int} = findfirst.(isequal.(cat_vec), Ref(levs))
        #! int_vec = findfirst.([isequal.(catv)[i].(levs) for i in 1:length(levs)])

        return int_vec
    end # convert_factor_to_index

An example:

julia> v = ["A", "B", "C"]
    convert_factor_to_index(categorical(v))
    3-element Vector{Int64}:
    1
    2
    3

So the function works, but I am not clear why it needs the Ref. (It does not work without it).

Also, it runs much slower with the following line of code:

int_vec = findfirst.([isequal.(catv)[i].(levs) for i in 1:length(levs)])

I understand broadcasting is fast, but does Ref also helps to speed up things?

Are you maybe looking for the levelcode function in CategoricalArrays?

julia> levelcode.(v)
3-element Vector{Int64}:
 1
 2
 3

To your other question, no, Ref is not speeding up anything, it’s simply required to protect levs from broadcasting. You could have equally wrapped it in a single-argument Tuple like (levs, ).

3 Likes

This doesn’t seem right:

int_vec = findfirst.([isequal.(catv)[i].(levs) for i in 1:length(levs)])

It’s as if you tried to replace broadcasting with list comprehension, but something went awry. I think you meant something like this:

int_vec = [findfirst([isequal(cat), levs) for cat in cat_vec]

At least that is what the initial broadcast is equivalent to. The Ref merely protects the levs argument to be broadcast.

Sifting through the available functions, I see that there are pool() and refs() to get the data of interest from a categorical vector

julia> using CategoricalArrays, BenchmarkTools

julia> function convert_factor_to_index(cat_vec::CategoricalVector)::Vector{Int}
           levs = levels(cat_vec)
       
           int_vec::Vector{Int} = findfirst.(isequal.(cat_vec), Ref(levs))
           #! int_vec = findfirst.([isequal.(catv)[i].(levs) for i in 1:length(levs)])
       
           return int_vec
       end # convert_factor_to_index
convert_factor_to_index (generic function with 1 method)

julia> v = ["A", "B", "C"]
3-element Vector{String}:
 "A"
 "B"
 "C"

julia> @btime convert_factor_to_index(categorical(v))
  378.325 ns (12 allocations: 944 bytes)
3-element Vector{Int64}:
 1
 2
 3

julia> @btime CategoricalArrays.refs(categorical($v))
  285.036 ns (10 allocations: 848 bytes)
3-element Vector{UInt32}:
 0x00000001
 0x00000002
 0x00000003

julia> @btime CategoricalArrays.pool(categorical($v))
  286.594 ns (10 allocations: 848 bytes)
CategoricalPool{String, UInt32}(["A", "B", "C"])

The code I proposed works if one replaces ‘catv’ with ‘cat_vec’:

v = ["A", "B", "C"]
cat_vec = categorical(v)
levs = levels(cat_vec)
int_vec = findfirst.([isequal.(cat_vec)[i].(levs) for i in 1:length(levs)])
3-element Vector{Int64}:
 1
 2
 3

That is what I needed. Thanks.

This is definitely less efficient than isequal(cat_vec[i]).

1 Like