DataFrames: is it possible to group data horizontally (multilevel)?

mike_k · April 17, 2019, 2:01pm

Dear community,
suppose we have a simple dataframe

df = DataFrame(Cat1 = ["A", "A", "B", "B"],
               Cat2 = ["C", "D", "C", "D"],
               val1 = 1:4,
               val2 = 5:8)

which looks like

julia> df
4×4 DataFrame
│ Row │ Cat1   │ Cat2   │ val1  │ val2  │
│     │ String │ String │ Int64 │ Int64 │
├─────┼────────┼────────┼───────┼───────┤
│ 1   │ A      │ C      │ 1     │ 5     │
│ 2   │ A      │ D      │ 2     │ 6     │
│ 3   │ B      │ C      │ 3     │ 7     │
│ 4   │ B      │ D      │ 4     │ 8     │

Is it possible to group the df by Cat1 horizontally such that we obtain something like:

               |       A       |       B
│ Row │ Cat2   │ val1  │ val2  │ val1  │ val2  │
│     │ String │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┼───────┼───────┤
│ 1   │ C      │ 1     │ 5     │ 1     │ 5     │
│ 2   │ D      │ 2     │ 6     │ 2     │ 6     │
│ 3   │ C      │ 3     │ 7     │ 3     │ 7     │
│ 4   │ D      │ 4     │ 8     │ 4     │ 8     │

Thank you in advance!!

nalimilan · April 17, 2019, 2:50pm

No, that’s not supported. Have a look at IndexedTables’s NDSparse type instead.

pdeffebach · April 17, 2019, 2:59pm

Can you clarify what you are trying to do with the data in this way? I don’t really understand your second data frame

               |       A       |       B
│ Row │ Cat2   │ val1  │ val2  │ val1  │ val2  │
│     │ String │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┼───────┼───────┤
│ 1   │ C      │ 1     │ 5     │ 1     │ 5     │
│ 2   │ D      │ 2     │ 6     │ 2     │ 6     │
│ 3   │ C      │ 3     │ 7     │ 3     │ 7     │
│ 4   │ D      │ 4     │ 8     │ 4     │ 8     │

The last two rows in the A :val2 column don’t make sense because :val2 never equals 7 or 8 on an "A" observation.

mike_k · April 17, 2019, 3:18pm

Sorry, this example is maybe oversimplified. In my real application I have a large dataframe with a lot of computational results which depend on several parameters. I obtain these results with different computational models which I like to compare. So I group by parameters and models, e.g.,

aggdf = by(df, [:model, :param1, :param2, :param3],
            nopt = :opt => sum,
            gap  = :gap => mean,
            rt   = :rt => mean,
            )

This works fine, however, the resulting dataframe lists the results of each model vertically. Since I want to figure out which model performs best for a specific parameter combination, it would be much nicer for me to compare the horizontally (and I would also create a table like that for a publication with multirow/multicol in Latex).
Clearly, I could split the dataframe accordingly and append them next to each other. But in this case I would also need to rename the columns, e.g., nopt1, gap1,…

mike_k · April 17, 2019, 3:19pm

Thank you, I will give it a try. However, I think this would be a nice feature also for DataFrames.

pdeffebach · April 17, 2019, 3:24pm

Ah I understand.

I think the best method is just to have :A_param1, :B_param1 etc. in your dataframes. Other alternatives could be to make a custom struct which holds all your parameters and have each observation in that column be one of those.

I see your point about exporting to latex. There isn’t a good replacement for something as complicated as R’s ggtable which would allow grouped columns like that.

Thinking about adding this feature to DataFrames would probably cause a lot of headaches in an already complicated indexing story, so groupings like this would probably have to be done at the latex printing stage, in another package.

mike_k · April 17, 2019, 3:33pm

I see,… thank you.

I was hoping that there is an easier option (although I found nothing in the docs), since I know that at least with pandas such kind of grouping works fine with unstack().

pdeffebach · April 17, 2019, 3:36pm

I can see the appeal. DataFrames is currently looking for a better API for stack / melt as evidenced by this issue.

jacobadenbaum · April 17, 2019, 4:13pm

Shameless self-promotion here, but if you’re looking for a nice way to generate grouped summary stats from Dataframes and print them out to latex, I have a package called TexTables that does this already works fairly well for this sort of stuff. There’s a function called summarize_by that might do what you want right out of the box, or it’s fairly easy to use it to construct the tables of summary statistics programatically to look exactly how you want them.

mike_k · April 17, 2019, 4:53pm

Thank you, this looks promising at the first glance. I will give it a try.

jacobadenbaum · April 17, 2019, 5:11pm

If you wind up trying it let me know if there’s anything that it should do that it doesn’t, or if you run into bugs. I’ve been fixing things to fit my use cases over time, but it’s always useful to get feedback from other people.

Topic		Replies	Views
Multilevel DataFrame New to Julia dataframes , data_structures , structure	8	849	September 6, 2022
Grouping by values in either of two columns Data question	13	783	April 14, 2024
DataFrame Groupby New to Julia dataframes	2	2148	April 26, 2018
DataFrame group by first column, and sort by last column Data dataframes	4	1887	October 16, 2018
New DataFrame whose columns are values of a column and grouped by another column General Usage dataframes	2	328	May 14, 2021

DataFrames: is it possible to group data horizontally (multilevel)?

Related topics