Is there a way that I can attach string labels to integer values in CategoricalArrays?

I noticed that CategoricalPool type allows labels for categorical values. I am wondering whether there is a way that I can convert an integer variable to a Categorical Array with string labels corresponding to each integer values. There is a function CategoricalPool and CategoricalString but I am not sure how I can use them to achieve what I want to do.

1 Like

Is this what you want?

julia> x = categorical([1,2,1,3])
4-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 1
 2
 1
 3

julia> recode(x, 1=>"a",2=>"b",3=>"c")
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "a"
 "c"

Thank you. It works.

However, I was wondering whether there is a direct way of attaching string labels to an existing CategoricalArray. I think an easier way of working with value labels is important.

Currently additional labels are not supported: categorical arrays simply contain values like a standard array. We’ve discussed supporting labels a few times, but I’m not completely sure yet what’s the use case for that. AFAICT the main interest would be to provide long labels in addition to the more compact levels, e.g. with levels "M" and "F" you would have labels "Male" and "Female".

Could you describe your use case?

In a GLM regression, a CategoricalArray or a PooledArray is recognized as a categorical variable. So for example, if I have a variable whose values are 1, 2, 3, and 4 that represent “White”, “Black”, “Hispanic”, and “Other”, I want to associate these labels with the numerical values. One way to get the desired CategoricalArray would be to create another variable that has “White” for 1, etc and convert it to a CategoricalArray. But if I can directly change the labels to the original CategoricalArray, it will be much easier and faster.

OK, so that doesn’t really match the use case I described. I think using recode is the best approach for you then. I think it would be hard to imagine an easier approach than that, and in terms of efficiency it should be pretty good too (though it creates a copy). For absolute maximal performance, we could add an option to reuse the underlying memory, but you cannot reuse the same CategoricalArray object since you cannot change the type of an existing object.

I see. Thank you for your reply.

Can you make recode work with a Dict of value and label pairs as well?

That would be easy, but the implementation would simply splat the dict into a varargs of pairs. So you can as well do it manually: recode([1, 2], Dict(1=>"A", 2=>"B")...).

A possible reason to add a Dict method would be to avoid compiling a specialized method for each particular set of pairs, which could make sense e.g. if the number of levels is very high and they can change dynamically. AFAICT that’s not your case, though.

Thank you. You’re right. I am not thinking of a case where there are a large number of levels. But even with 4 or 5 levels, one Dict of value - label pairs can be used for the many variables in some cases. Typically, surveys tend to have many of such questions:

1 = Yes
2 = No
3 = Refused
5 = Unknown
…

So if one Dict works for many questions, that will help simplify using recode. I recognize that using splat operator after a Dict is easy enough so it may be a moot point.

Ideally, separating labels from CategoricalArrays might be a good idea in such a way that labels are saved in a different memory space than the CategoricalArray itself (maybe somewhere in the same DataFrame or even separately from the DataFrame itself. Stata takes the first approach and SAS takes the second approach).

Here’s an example why that might be a good idea. In several datasets that I am using, diseases and treatments are coded in what are known as ICD-9 and CPT codes. Many datasets (inpatient, outpatient records) tends to have 10 or more fields for diseases for each visit by a patient and many codes for treatments as well. So current implementation of CategoricalArrays would repeat exactly the same labels for each categorical array, resulting in a needlessly large DataFrame.

Maybe this is a use case you might want to give some thought to.

This can be handled by sharing a CategoricalPool among all columns, though this feature is currently not very well documented nor exposed in the public API.

But typically the number of levels is going to be much smaller than the number of observations, so I doubt it will make any difference on storage size or performance. Have you measured this on your particular use case? On the other hand, sharing pools can complicate things a lot since modifying the pool of levels would require adapting all arrays, and keeping track of which ones need to be modified.