Categorical to Integer values?

If I have a DataFrame with a categorical column, how can I convert it to integers, like in R if foo is a data table with a categorical variable bar, you would do as.numeric(foo$bar)

I plan to index into an array based on which category things are in… and want to work with integers, not names of integers.

1 Like

I see that I can access the internal “refs” field and this is the number that indexes the levels… but is this the “official” way?

We should definitely add a public function to get integer codes. Waiting for it you can access the refs field, but note that its values correspond to the order in index(x.pool), not to those in levels(x). See the warning at https://juliadata.github.io/CategoricalArrays.jl/latest/implementation.html.

2 Likes

Thanks for pointing that out! I think ideally you’d just create a constructor for Int… and it’d return the index into the levels… so something like this?

function Int(a::CategoricalArray)
   map(x -> Int(x),CategoricalArrays.order(a.pool)[a.refs]);
end

Similarly for explicit UInt32 and UInt64 and etc.

WHOOPS, I’m obviously new to Julia, of course you’re going to return an Array{Int} so I guess you’d want

function Array{Int}(a::...

I am not sure if your function definition is a good idea. It seems you might be overloading Int. How about just converting the type (to whatever type you want):

using RDatasets
iris = dataset("datasets", "iris")

v=iris[:Species]

function get_indices(v)
    return CategoricalArrays.order(v.pool)[v.refs]
end
    
refs=get_indices(iris[:Species])
refs2=convert(Vector{Int},refs)

I’m just learning Julia so I don’t know all the idioms but that seems like a bunch of work and it does exactly the same thing right? namely it gives you an array of ints?

what is the reason not to just construct an Array{Int} ?

indeed the Julia manual page 177 discusses convert vs constructor and clearly says conversion is for different representation of the same thing… here we are constructing an array of ints from a categorical representation of strings, it seems like a constructor is exactly what we want.

As I said above: “I am not sure”. Feel free to use whatever approach works for you.

I guess the question is, is there a good abstraction here that should be built-in to the CategoricalArrays package? It seems to me like an Array{Int} constructor is something the package should provide and it should give integers that index into the levels.

Yes, it could be confusing to use Int since e.g. a categorical array with levels ["3", "2", "1"] or even worse [3, 2, 1] would return values which do not correspond to the levels, but to their order. So calling Int.(x) on the array could give unexpected results. Probably better introduce a dedicated function, but I’m not sure what name would be appropriate.

I’ll buy that, so how about level_codes for a function name.

I am not opposed to something like level_codes and probably first best. I still think something like convert(Int, x) could/should return the reference position in the pool.

maybe categories is a better name given that these are categorical arrays

well I’d expect categories to return what levels returns, basically the category names… but since we have levels already and the R equivalent is levels… it seems level_codes makes good sense, or level_numbers or level_indices of those I think I prefer codes because it’s easy to say and type

given that levels is there, levels_indices seems most explicit and unambiguous (since a level can also be a code or a number)

that said, i feel that levels is a bit of a misnomer since it suggests something ordinal and not categorical as opposed to, say, values or categories

1 Like

the terminology is traditional and well known in statistics where categories are called factors and the levels of the factors is standard terminology, so I guess it depends on background as to which is more intuitive

1 Like

since the package is called CategoricalArrays.jl my brain was just primed for categories

I changed the type of categorical variables in columns “a” and “b” of a dataframe (df) by using the following code

df[:a]=Vector{Union{Float64}}(df[:a])
df[:b]=Vector{Union{Int, missing}}(df[:b])

The solution in the following Youtube video
Dataframe tutorial - categorical variables