This function converts a categorical vector to an index:
function convert_factor_to_index(cat_vec::CategoricalVector)::Vector{Int}
levs = levels(cat_vec)
int_vec::Vector{Int} = findfirst.(isequal.(cat_vec), Ref(levs))
#! int_vec = findfirst.([isequal.(catv)[i].(levs) for i in 1:length(levs)])
return int_vec
end # convert_factor_to_index
An example:
julia> v = ["A", "B", "C"]
convert_factor_to_index(categorical(v))
3-element Vector{Int64}:
1
2
3
So the function works, but I am not clear why it needs the Ref
. (It does not work without it).
Also, it runs much slower with the following line of code:
int_vec = findfirst.([isequal.(catv)[i].(levs) for i in 1:length(levs)])
I understand broadcasting is fast, but does Ref
also helps to speed up things?
nilshg
June 14, 2023, 9:40am
2
Are you maybe looking for the levelcode
function in CategoricalArrays
?
julia> levelcode.(v)
3-element Vector{Int64}:
1
2
3
To your other question, no, Ref
is not speeding up anything, it’s simply required to protect levs
from broadcasting. You could have equally wrapped it in a single-argument Tuple
like (levs, )
.
3 Likes
HanD
June 14, 2023, 1:13pm
3
This doesn’t seem right:
int_vec = findfirst.([isequal.(catv)[i].(levs) for i in 1:length(levs)])
It’s as if you tried to replace broadcasting with list comprehension, but something went awry. I think you meant something like this:
int_vec = [findfirst([isequal(cat), levs) for cat in cat_vec]
At least that is what the initial broadcast is equivalent to. The Ref
merely protects the levs
argument to be broadcast.
Sifting through the available functions, I see that there are pool()
and refs()
to get the data of interest from a categorical vector
julia> using CategoricalArrays, BenchmarkTools
julia> function convert_factor_to_index(cat_vec::CategoricalVector)::Vector{Int}
levs = levels(cat_vec)
int_vec::Vector{Int} = findfirst.(isequal.(cat_vec), Ref(levs))
#! int_vec = findfirst.([isequal.(catv)[i].(levs) for i in 1:length(levs)])
return int_vec
end # convert_factor_to_index
convert_factor_to_index (generic function with 1 method)
julia> v = ["A", "B", "C"]
3-element Vector{String}:
"A"
"B"
"C"
julia> @btime convert_factor_to_index(categorical(v))
378.325 ns (12 allocations: 944 bytes)
3-element Vector{Int64}:
1
2
3
julia> @btime CategoricalArrays.refs(categorical($v))
285.036 ns (10 allocations: 848 bytes)
3-element Vector{UInt32}:
0x00000001
0x00000002
0x00000003
julia> @btime CategoricalArrays.pool(categorical($v))
286.594 ns (10 allocations: 848 bytes)
CategoricalPool{String, UInt32}(["A", "B", "C"])
The code I proposed works if one replaces ‘catv’ with ‘cat_vec’:
v = ["A", "B", "C"]
cat_vec = categorical(v)
levs = levels(cat_vec)
int_vec = findfirst.([isequal.(cat_vec)[i].(levs) for i in 1:length(levs)])
3-element Vector{Int64}:
1
2
3
That is what I needed. Thanks.
Soldalma:
isequal.(cat_vec)[i]
This is definitely less efficient than isequal(cat_vec[i])
.
1 Like