If I have a DataFrame with a categorical column, how can I convert it to integers, like in R if foo is a data table with a categorical variable bar, you would do as.numeric(foo$bar)
I plan to index into an array based on which category things are in… and want to work with integers, not names of integers.
We should definitely add a public function to get integer codes. Waiting for it you can access the refs field, but note that its values correspond to the order in index(x.pool), not to those in levels(x). See the warning at Implementation details · CategoricalArrays.
Thanks for pointing that out! I think ideally you’d just create a constructor for Int… and it’d return the index into the levels… so something like this?
function Int(a::CategoricalArray)
map(x -> Int(x),CategoricalArrays.order(a.pool)[a.refs]);
end
Similarly for explicit UInt32 and UInt64 and etc.
WHOOPS, I’m obviously new to Julia, of course you’re going to return an Array{Int} so I guess you’d want
I am not sure if your function definition is a good idea. It seems you might be overloading Int. How about just converting the type (to whatever type you want):
using RDatasets
iris = dataset("datasets", "iris")
v=iris[:Species]
function get_indices(v)
return CategoricalArrays.order(v.pool)[v.refs]
end
refs=get_indices(iris[:Species])
refs2=convert(Vector{Int},refs)
I’m just learning Julia so I don’t know all the idioms but that seems like a bunch of work and it does exactly the same thing right? namely it gives you an array of ints?
what is the reason not to just construct an Array{Int} ?
indeed the Julia manual page 177 discusses convert vs constructor and clearly says conversion is for different representation of the same thing… here we are constructing an array of ints from a categorical representation of strings, it seems like a constructor is exactly what we want.
I guess the question is, is there a good abstraction here that should be built-in to the CategoricalArrays package? It seems to me like an Array{Int} constructor is something the package should provide and it should give integers that index into the levels.
Yes, it could be confusing to use Int since e.g. a categorical array with levels ["3", "2", "1"] or even worse [3, 2, 1] would return values which do not correspond to the levels, but to their order. So calling Int.(x) on the array could give unexpected results. Probably better introduce a dedicated function, but I’m not sure what name would be appropriate.
I am not opposed to something like level_codes and probably first best. I still think something like convert(Int, x) could/should return the reference position in the pool.
well I’d expect categories to return what levels returns, basically the category names… but since we have levels already and the R equivalent is levels… it seems level_codes makes good sense, or level_numbers or level_indices of those I think I prefer codes because it’s easy to say and type
the terminology is traditional and well known in statistics where categories are called factors and the levels of the factors is standard terminology, so I guess it depends on background as to which is more intuitive