Dataframes: displaying categorical key when displaying a dataframe while saving the value in the data array

00krishna · April 24, 2024, 5:25pm

I have a question about displaying categorical data in a dataframe. I want to see if there is a way to display the categorical key value, while preserving the categorical value in the data. The point is to help users working with the data more easily interpret the values instead of trying to guess about the numerical values.

Note that it is important to keep theo original categorical values in the dataframe–meaning values like 1, 2, 3 for the months. Since those values come from the metadata for the source data, we want to use the original numerical encodings. So this question is really about convenience when displaying the information.

So I have data that looks like

julia> DataFrame(month = [1, 2, 3], sensor1 = [2.1, 2.4, 5.1])
3×2 DataFrame
 Row │ month  sensor1 
     │ Int64  Float64 
─────┼────────────────
   1 │     1      2.1
   2 │     2      2.4
   3 │     3      5.1

This display is fine, but having the user interpret month=2 as February makes it harder for a user to quickly interpret the data. Months are pretty obvious, but other categories may be less intuitive.

So I wanted to see if I could display the category key, while keeping the categorical value in the dataframe. For example

 Row │ month     sensor1 
     │ String    Float64 
─────┼───────────────────
   1 │ january       2.1
   2 │ february      2.4
   3 │ march         5.1

Once again, I don’t want to overwrite the original numerical data in the month column: [1, 2, 3]. I just want to be able to display the category key rather than the numerical values.

pdeffebach · April 24, 2024, 5:31pm

There isn’t a good package for this. The package ReadStatTables.jl has a LabelledArray type that implements Stata-like labeling (which is basically what you are asking for). But it’s not available on its own. Someone needs to split that type out into its own package so people can use it without the full ReadStatTables.jl dependency.

00krishna · April 24, 2024, 5:40pm

Thanks @pdeffebach . Hmm interesting. Okay, so I can’t do this for now? I am not very familiar with the ReadStatTables.jl package, so forgive me. So you are saying that I would need a LabelledArray type, and that I can’t use that type of array in a DataFrame right now?

pdeffebach · April 24, 2024, 6:19pm

I’m saying a type exists, with the behavior you want, but it isn’t in it’s own package and I don’t think it would be worth it to take on the big dependency for the simple array type.

It’s not like you “can’t” use it in a data frame, it’s just that it wouldn’t be worth it.

00krishna · April 24, 2024, 6:32pm

Okay, I understand. That makes sense. Thanks @pdeffebach . I will post to the DataFrames.jl repo, and see if there is any plan to incorporate a feature like this into their package. I will mention LabelledArrays too, in case they want to look at the implementation.

pdeffebach · April 24, 2024, 6:33pm

No. That’s not quite right. DataFrames.jl is agnostic as to array types that can be used in DataFrames. You can use CategoricalArrays, PooledArrays, etc. with no loss of features.

If you are to do anything, file an issue in ReadStatTables.jl about moving LabelledArray into a separate package.

00krishna · April 24, 2024, 6:39pm

Ahh, okay. Yeah, I can post the issue there. Thanks for putting up with my confusions. But I can open an issue in the ReadStatTables.jl repo.

nilshg · April 24, 2024, 9:23pm

Couldn’t this just be implemented as a format function in Pretty Tables though?

pdeffebach · April 24, 2024, 9:45pm

Maybe? Wouldn’t be useful for interactive work though.

00krishna · April 25, 2024, 1:47am

@nilshg can you explain your idea a bit. The point is that the text should be visible when the user is looking at the data. So if I do first(df, 5) or something, I would like to display the category text value in each field–if that field is indeed categorical. I don’t want to change the underlying numerical values. So could that work for your idea?

nilshg · April 25, 2024, 7:55am

I mean something like this (adapted from your integer/month example, but hopefully the extension to categorical arrays etc is obvious):

julia> using DataFrames, Dates

julia> df = DataFrame(month = rand(1:12, 5), sensor1 = rand(5))
5×2 DataFrame
 Row │ month  sensor1
     │ Int64  Float64
─────┼─────────────────
   1 │     3  0.236443
   2 │     2  0.520037
   3 │     6  0.177478
   4 │     1  0.199209
   5 │     3  0.43456

julia> DataFrames._pretty_tables_general_formatter(v::Int, i::Int, j::Int) = Dates.format.(Date(0):Month(1):Date(0, 12), "u")[v]

julia> df
5×2 DataFrame
 Row │ month  sensor1
     │ Int64  Float64
─────┼─────────────────
   1 │   Mar  0.236443
   2 │   Feb  0.520037
   3 │   Jun  0.177478
   4 │   Jan  0.199209
   5 │   Mar  0.43456

julia> first(df, 3)
3×2 DataFrame
 Row │ month  sensor1
     │ Int64  Float64
─────┼─────────────────
   1 │   Mar  0.236443
   2 │   Feb  0.520037
   3 │   Jun  0.177478

This relies on internals but I’m using this a lot for e.g. pretty printing of large numbers (with thousand separators) etc.

I think I’ve had some discussion with @bkamins and @Ronis_BR at some point to make this functionality public but I can’t remember where we ended up.

bkamins · April 25, 2024, 6:14pm

An alternative that also supports it is GitHub - sl-solution/InMemoryDatasets.jl: Multithreaded package for working with tabular data in Julia

00krishna · April 25, 2024, 8:53pm

@bkamins thanks for responding. I was looking at the docs on InMemoryDatasets, but I was not clear on a couple of things. First, since the package ostensibly keeps the data in-memory, is there a limit on the size of the data that the user can load? I am just asking because the data that was are looking at is USA Census data, so some users might have rather large datasets. We are still trying to figure out whether saving the data in a local database, and querying it is a good option, etc. But is memory a genuine concern with this package?

Second, do all of the usual filtering tools/minilanguage and Tidy packages work with InMemoryDatasets.jl? The package seems new so of course it won’t be as complete as DataFrames.jl, but I was just wondering if the having the same minilanguage and Tidy packages working is planned or anticipated.

Thanks again for the suggestion Bogumil.

bkamins · April 25, 2024, 9:07pm

I am not maintaining this package and I do not use it, I just know of its existence so I thought I would give the information about it here. Probably the maintainers can help you with your questions if you raise an issue on its GitHub repository. I hope this helps.

Topic		Replies	Views
How to "label" DataFrame categorical column? Data dataframes	1	1026	September 7, 2021
Add metadata to categorical array Data	4	142	July 16, 2024
Dataframes and categorical, good style? General Usage dataframes	5	395	October 24, 2020
Arrow's DictEncode to CategoricalArray? Data dataframes , arrow	18	1755	February 11, 2025
Create a data frame with several self-defined types in Julia General Usage	4	571	April 20, 2019

Dataframes: displaying categorical key when displaying a dataframe while saving the value in the data array

Related topics