I have a question about displaying categorical data in a dataframe. I want to see if there is a way to display the categorical key value, while preserving the categorical value in the data. The point is to help users working with the data more easily interpret the values instead of trying to guess about the numerical values.
Note that it is important to keep theo original categorical values in the dataframeβmeaning values like 1, 2, 3 for the months. Since those values come from the metadata for the source data, we want to use the original numerical encodings. So this question is really about convenience when displaying the information.
This display is fine, but having the user interpret month=2 as February makes it harder for a user to quickly interpret the data. Months are pretty obvious, but other categories may be less intuitive.
So I wanted to see if I could display the category key, while keeping the categorical value in the dataframe. For example
Row β month sensor1
β String Float64
ββββββΌβββββββββββββββββββ
1 β january 2.1
2 β february 2.4
3 β march 5.1
Once again, I donβt want to overwrite the original numerical data in the month column: [1, 2, 3]. I just want to be able to display the category key rather than the numerical values.
There isnβt a good package for this. The package ReadStatTables.jl has a LabelledArray type that implements Stata-like labeling (which is basically what you are asking for). But itβs not available on its own. Someone needs to split that type out into its own package so people can use it without the full ReadStatTables.jl dependency.
Thanks @pdeffebach . Hmm interesting. Okay, so I canβt do this for now? I am not very familiar with the ReadStatTables.jl package, so forgive me. So you are saying that I would need a LabelledArray type, and that I canβt use that type of array in a DataFrame right now?
Iβm saying a type exists, with the behavior you want, but it isnβt in itβs own package and I donβt think it would be worth it to take on the big dependency for the simple array type.
Itβs not like you βcanβtβ use it in a data frame, itβs just that it wouldnβt be worth it.
Okay, I understand. That makes sense. Thanks @pdeffebach . I will post to the DataFrames.jl repo, and see if there is any plan to incorporate a feature like this into their package. I will mention LabelledArrays too, in case they want to look at the implementation.
No. Thatβs not quite right. DataFrames.jl is agnostic as to array types that can be used in DataFrames. You can use CategoricalArrays, PooledArrays, etc. with no loss of features.
If you are to do anything, file an issue in ReadStatTables.jl about moving LabelledArray into a separate package.
@nilshg can you explain your idea a bit. The point is that the text should be visible when the user is looking at the data. So if I do first(df, 5) or something, I would like to display the category text value in each fieldβif that field is indeed categorical. I donβt want to change the underlying numerical values. So could that work for your idea?
This relies on internals but Iβm using this a lot for e.g. pretty printing of large numbers (with thousand separators) etc.
I think Iβve had some discussion with @bkamins and @Ronis_BR at some point to make this functionality public but I canβt remember where we ended up.
@bkamins thanks for responding. I was looking at the docs on InMemoryDatasets, but I was not clear on a couple of things. First, since the package ostensibly keeps the data in-memory, is there a limit on the size of the data that the user can load? I am just asking because the data that was are looking at is USA Census data, so some users might have rather large datasets. We are still trying to figure out whether saving the data in a local database, and querying it is a good option, etc. But is memory a genuine concern with this package?
Second, do all of the usual filtering tools/minilanguage and Tidy packages work with InMemoryDatasets.jl? The package seems new so of course it wonβt be as complete as DataFrames.jl, but I was just wondering if the having the same minilanguage and Tidy packages working is planned or anticipated.
I am not maintaining this package and I do not use it, I just know of its existence so I thought I would give the information about it here. Probably the maintainers can help you with your questions if you raise an issue on its GitHub repository. I hope this helps.