Dataframes.jl has a nice feature for adding metadata to a table:
Metadata · DataFrames.jl (juliadata.org)
This is useful because often you want to use shorter variable names when coding but you want to be able to conveniently look up what the represents (or print more verbose variable names or descriptions when creating a table).
Is there something similar for CategoricalArrays.jl? That is, if I want to use relatively short names in the levels of the categorical array, but also want longer names or descriptions for each of the categorical values, is there a package that allows me to conveniently add this metadata to my categorical arrays and conveniently exploit it?
Thanks!
2 Likes
There was some work on it here GitHub - JuliaArrays/MetadataArrays.jl but I do not know the current status.
This is the kind of feature which could make sense to have in CategoricalArrays, and I’ve thought about it in the past. But that would also make the package more complex so I’m somewhat hesitant. Storing a set of descriptions is easy, but you’d have to handle the (default) case where no descriptions are provided, decide what to do when concatenating two pools with identical levels but different descriptions, etc.
This discussion is similar to the one we had about supporting the same features as LabelledArray
(which I haven’t done in the end): `LabeledArray` and `CategoricalArray` · Issue #4 · junyuan-chen/ReadStatTables.jl · GitHub
1 Like
The MetadataArrays approach does look cleaner and more composable: just as CategoricalArrays, other arrays can also be useful to equip with metadata. The only (but significant) friction point here is the general “multiple nested wrappers” problem.
1 Like
Makes sense, thanks! BTW I’m a big fan of the CategoricalArrays.jl package.
A related motivation for this long-labels feature: some categorical variables are encoded as non-intelligible string of alphanumeric characters. Like the refs array, they are unique identifiers, but, unlike the refs array, they can’t be used to index arrays, unnecessarily take up memory, and are slower to execute tests on (e.g. == operation is very fast for ints).
Because they are not human intelligible, you would want to replace them with their labels in the categorical array. However, for consistency, you would want your scripts to define operations on the provided alphanumeric code (e.g. catarr .== “az5w45”).