# What is the role of Ref in this function and why is it faster than the alternative?

This function converts a categorical vector to an index:

``````function convert_factor_to_index(cat_vec::CategoricalVector)::Vector{Int}
levs = levels(cat_vec)

int_vec::Vector{Int} = findfirst.(isequal.(cat_vec), Ref(levs))
#! int_vec = findfirst.([isequal.(catv)[i].(levs) for i in 1:length(levs)])

return int_vec
end # convert_factor_to_index
``````

An example:

``````julia> v = ["A", "B", "C"]
convert_factor_to_index(categorical(v))
3-element Vector{Int64}:
1
2
3
``````

So the function works, but I am not clear why it needs the `Ref`. (It does not work without it).

Also, it runs much slower with the following line of code:

``````int_vec = findfirst.([isequal.(catv)[i].(levs) for i in 1:length(levs)])
``````

I understand broadcasting is fast, but does `Ref` also helps to speed up things?

Are you maybe looking for the `levelcode` function in `CategoricalArrays`?

``````julia> levelcode.(v)
3-element Vector{Int64}:
1
2
3
``````

To your other question, no, `Ref` is not speeding up anything, it’s simply required to protect `levs` from broadcasting. You could have equally wrapped it in a single-argument `Tuple` like `(levs, )`.

3 Likes

This doesn’t seem right:

``````int_vec = findfirst.([isequal.(catv)[i].(levs) for i in 1:length(levs)])
``````

It’s as if you tried to replace broadcasting with list comprehension, but something went awry. I think you meant something like this:

``````int_vec = [findfirst([isequal(cat), levs) for cat in cat_vec]
``````

At least that is what the initial broadcast is equivalent to. The `Ref` merely protects the `levs` argument to be broadcast.

Sifting through the available functions, I see that there are `pool()` and `refs()` to get the data of interest from a categorical vector

``````julia> using CategoricalArrays, BenchmarkTools

julia> function convert_factor_to_index(cat_vec::CategoricalVector)::Vector{Int}
levs = levels(cat_vec)

int_vec::Vector{Int} = findfirst.(isequal.(cat_vec), Ref(levs))
#! int_vec = findfirst.([isequal.(catv)[i].(levs) for i in 1:length(levs)])

return int_vec
end # convert_factor_to_index
convert_factor_to_index (generic function with 1 method)

julia> v = ["A", "B", "C"]
3-element Vector{String}:
"A"
"B"
"C"

julia> @btime convert_factor_to_index(categorical(v))
378.325 ns (12 allocations: 944 bytes)
3-element Vector{Int64}:
1
2
3

julia> @btime CategoricalArrays.refs(categorical(\$v))
285.036 ns (10 allocations: 848 bytes)
3-element Vector{UInt32}:
0x00000001
0x00000002
0x00000003

julia> @btime CategoricalArrays.pool(categorical(\$v))
286.594 ns (10 allocations: 848 bytes)
CategoricalPool{String, UInt32}(["A", "B", "C"])
``````

The code I proposed works if one replaces ‘catv’ with ‘cat_vec’:

``````v = ["A", "B", "C"]
cat_vec = categorical(v)
levs = levels(cat_vec)
int_vec = findfirst.([isequal.(cat_vec)[i].(levs) for i in 1:length(levs)])
3-element Vector{Int64}:
1
2
3
``````

That is what I needed. Thanks.

This is definitely less efficient than `isequal(cat_vec[i])`.

1 Like