Is there a way that I can attach string labels to integer values in CategoricalArrays?

mwsohn · February 10, 2018, 6:20am

I noticed that CategoricalPool type allows labels for categorical values. I am wondering whether there is a way that I can convert an integer variable to a Categorical Array with string labels corresponding to each integer values. There is a function CategoricalPool and CategoricalString but I am not sure how I can use them to achieve what I want to do.

bkamins · February 10, 2018, 6:52am

Is this what you want?

julia> x = categorical([1,2,1,3])
4-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 1
 2
 1
 3

julia> recode(x, 1=>"a",2=>"b",3=>"c")
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "a"
 "c"

mwsohn · February 10, 2018, 5:22pm

Thank you. It works.

However, I was wondering whether there is a direct way of attaching string labels to an existing CategoricalArray. I think an easier way of working with value labels is important.

nalimilan · February 10, 2018, 5:43pm

Currently additional labels are not supported: categorical arrays simply contain values like a standard array. We’ve discussed supporting labels a few times, but I’m not completely sure yet what’s the use case for that. AFAICT the main interest would be to provide long labels in addition to the more compact levels, e.g. with levels "M" and "F" you would have labels "Male" and "Female".

Could you describe your use case?

mwsohn · February 10, 2018, 5:54pm

In a GLM regression, a CategoricalArray or a PooledArray is recognized as a categorical variable. So for example, if I have a variable whose values are 1, 2, 3, and 4 that represent “White”, “Black”, “Hispanic”, and “Other”, I want to associate these labels with the numerical values. One way to get the desired CategoricalArray would be to create another variable that has “White” for 1, etc and convert it to a CategoricalArray. But if I can directly change the labels to the original CategoricalArray, it will be much easier and faster.

nalimilan · February 10, 2018, 11:29pm

OK, so that doesn’t really match the use case I described. I think using recode is the best approach for you then. I think it would be hard to imagine an easier approach than that, and in terms of efficiency it should be pretty good too (though it creates a copy). For absolute maximal performance, we could add an option to reuse the underlying memory, but you cannot reuse the same CategoricalArray object since you cannot change the type of an existing object.

mwsohn · February 11, 2018, 4:02am

I see. Thank you for your reply.

Can you make recode work with a Dict of value and label pairs as well?

nalimilan · February 11, 2018, 10:31am

That would be easy, but the implementation would simply splat the dict into a varargs of pairs. So you can as well do it manually: recode([1, 2], Dict(1=>"A", 2=>"B")...).

A possible reason to add a Dict method would be to avoid compiling a specialized method for each particular set of pairs, which could make sense e.g. if the number of levels is very high and they can change dynamically. AFAICT that’s not your case, though.

mwsohn · February 11, 2018, 9:17pm

Thank you. You’re right. I am not thinking of a case where there are a large number of levels. But even with 4 or 5 levels, one Dict of value - label pairs can be used for the many variables in some cases. Typically, surveys tend to have many of such questions:

1 = Yes
2 = No
3 = Refused
5 = Unknown
…

So if one Dict works for many questions, that will help simplify using recode. I recognize that using splat operator after a Dict is easy enough so it may be a moot point.

Ideally, separating labels from CategoricalArrays might be a good idea in such a way that labels are saved in a different memory space than the CategoricalArray itself (maybe somewhere in the same DataFrame or even separately from the DataFrame itself. Stata takes the first approach and SAS takes the second approach).

Here’s an example why that might be a good idea. In several datasets that I am using, diseases and treatments are coded in what are known as ICD-9 and CPT codes. Many datasets (inpatient, outpatient records) tends to have 10 or more fields for diseases for each visit by a patient and many codes for treatments as well. So current implementation of CategoricalArrays would repeat exactly the same labels for each categorical array, resulting in a needlessly large DataFrame.

Maybe this is a use case you might want to give some thought to.

nalimilan · February 11, 2018, 10:24pm

This can be handled by sharing a CategoricalPool among all columns, though this feature is currently not very well documented nor exposed in the public API.

But typically the number of levels is going to be much smaller than the number of observations, so I doubt it will make any difference on storage size or performance. Have you measured this on your particular use case? On the other hand, sharing pools can complicate things a lot since modifying the pool of levels would require adapting all arrays, and keeping track of which ones need to be modified.

Topic		Replies	Views
Categorical to Integer values? New to Julia	16	5107	March 18, 2022
How to "label" DataFrame categorical column? Data dataframes	1	1025	September 7, 2021
Add metadata to categorical array Data	4	139	July 16, 2024
How to index a `CatagoricalArray` then make a new array with the same levels Data	4	535	August 15, 2018
Create a data frame with several self-defined types in Julia General Usage	4	570	April 20, 2019

Is there a way that I can attach string labels to integer values in CategoricalArrays?

Related topics