Dataframes: displaying categorical key when displaying a dataframe while saving the value in the data array

I have a question about displaying categorical data in a dataframe. I want to see if there is a way to display the categorical key value, while preserving the categorical value in the data. The point is to help users working with the data more easily interpret the values instead of trying to guess about the numerical values.

Note that it is important to keep theo original categorical values in the dataframe–meaning values like 1, 2, 3 for the months. Since those values come from the metadata for the source data, we want to use the original numerical encodings. So this question is really about convenience when displaying the information.

So I have data that looks like

julia> DataFrame(month = [1, 2, 3], sensor1 = [2.1, 2.4, 5.1])
3Γ—2 DataFrame
 Row β”‚ month  sensor1 
     β”‚ Int64  Float64 
─────┼────────────────
   1 β”‚     1      2.1
   2 β”‚     2      2.4
   3 β”‚     3      5.1

This display is fine, but having the user interpret month=2 as February makes it harder for a user to quickly interpret the data. Months are pretty obvious, but other categories may be less intuitive.

So I wanted to see if I could display the category key, while keeping the categorical value in the dataframe. For example

 Row β”‚ month     sensor1 
     β”‚ String    Float64 
─────┼───────────────────
   1 β”‚ january       2.1
   2 β”‚ february      2.4
   3 β”‚ march         5.1

Once again, I don’t want to overwrite the original numerical data in the month column: [1, 2, 3]. I just want to be able to display the category key rather than the numerical values.

1 Like

There isn’t a good package for this. The package ReadStatTables.jl has a LabelledArray type that implements Stata-like labeling (which is basically what you are asking for). But it’s not available on its own. Someone needs to split that type out into its own package so people can use it without the full ReadStatTables.jl dependency.

Thanks @pdeffebach . Hmm interesting. Okay, so I can’t do this for now? I am not very familiar with the ReadStatTables.jl package, so forgive me. So you are saying that I would need a LabelledArray type, and that I can’t use that type of array in a DataFrame right now?

I’m saying a type exists, with the behavior you want, but it isn’t in it’s own package and I don’t think it would be worth it to take on the big dependency for the simple array type.

It’s not like you β€œcan’t” use it in a data frame, it’s just that it wouldn’t be worth it.

Okay, I understand. That makes sense. Thanks @pdeffebach . I will post to the DataFrames.jl repo, and see if there is any plan to incorporate a feature like this into their package. I will mention LabelledArrays too, in case they want to look at the implementation.

No. That’s not quite right. DataFrames.jl is agnostic as to array types that can be used in DataFrames. You can use CategoricalArrays, PooledArrays, etc. with no loss of features.

If you are to do anything, file an issue in ReadStatTables.jl about moving LabelledArray into a separate package.

Ahh, okay. Yeah, I can post the issue there. Thanks for putting up with my confusions. But I can open an issue in the ReadStatTables.jl repo.

Couldn’t this just be implemented as a format function in Pretty Tables though?

Maybe? Wouldn’t be useful for interactive work though.

@nilshg can you explain your idea a bit. The point is that the text should be visible when the user is looking at the data. So if I do first(df, 5) or something, I would like to display the category text value in each field–if that field is indeed categorical. I don’t want to change the underlying numerical values. So could that work for your idea?

I mean something like this (adapted from your integer/month example, but hopefully the extension to categorical arrays etc is obvious):

julia> using DataFrames, Dates

julia> df = DataFrame(month = rand(1:12, 5), sensor1 = rand(5))
5Γ—2 DataFrame
 Row β”‚ month  sensor1
     β”‚ Int64  Float64
─────┼─────────────────
   1 β”‚     3  0.236443
   2 β”‚     2  0.520037
   3 β”‚     6  0.177478
   4 β”‚     1  0.199209
   5 β”‚     3  0.43456

julia> DataFrames._pretty_tables_general_formatter(v::Int, i::Int, j::Int) = Dates.format.(Date(0):Month(1):Date(0, 12), "u")[v]

julia> df
5Γ—2 DataFrame
 Row β”‚ month  sensor1
     β”‚ Int64  Float64
─────┼─────────────────
   1 β”‚   Mar  0.236443
   2 β”‚   Feb  0.520037
   3 β”‚   Jun  0.177478
   4 β”‚   Jan  0.199209
   5 β”‚   Mar  0.43456

julia> first(df, 3)
3Γ—2 DataFrame
 Row β”‚ month  sensor1
     β”‚ Int64  Float64
─────┼─────────────────
   1 β”‚   Mar  0.236443
   2 β”‚   Feb  0.520037
   3 β”‚   Jun  0.177478

This relies on internals but I’m using this a lot for e.g. pretty printing of large numbers (with thousand separators) etc.

I think I’ve had some discussion with @bkamins and @Ronis_BR at some point to make this functionality public but I can’t remember where we ended up.

2 Likes

An alternative that also supports it is GitHub - sl-solution/InMemoryDatasets.jl: Multithreaded package for working with tabular data in Julia

@bkamins thanks for responding. I was looking at the docs on InMemoryDatasets, but I was not clear on a couple of things. First, since the package ostensibly keeps the data in-memory, is there a limit on the size of the data that the user can load? I am just asking because the data that was are looking at is USA Census data, so some users might have rather large datasets. We are still trying to figure out whether saving the data in a local database, and querying it is a good option, etc. But is memory a genuine concern with this package?

Second, do all of the usual filtering tools/minilanguage and Tidy packages work with InMemoryDatasets.jl? The package seems new so of course it won’t be as complete as DataFrames.jl, but I was just wondering if the having the same minilanguage and Tidy packages working is planned or anticipated.

Thanks again for the suggestion Bogumil.

I am not maintaining this package and I do not use it, I just know of its existence so I thought I would give the information about it here. Probably the maintainers can help you with your questions if you raise an issue on its GitHub repository. I hope this helps.

1 Like