DataFrames: is it possible to group data horizontally (multilevel)?

Dear community,
suppose we have a simple dataframe

df = DataFrame(Cat1 = ["A", "A", "B", "B"],
               Cat2 = ["C", "D", "C", "D"],
               val1 = 1:4,
               val2 = 5:8)

which looks like

julia> df
4×4 DataFrame
│ Row │ Cat1   │ Cat2   │ val1  │ val2  │
│     │ String │ String │ Int64 │ Int64 │
├─────┼────────┼────────┼───────┼───────┤
│ 1   │ A      │ C      │ 1     │ 5     │
│ 2   │ A      │ D      │ 2     │ 6     │
│ 3   │ B      │ C      │ 3     │ 7     │
│ 4   │ B      │ D      │ 4     │ 8     │

Is it possible to group the df by Cat1 horizontally such that we obtain something like:

               |       A       |       B
│ Row │ Cat2   │ val1  │ val2  │ val1  │ val2  │
│     │ String │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┼───────┼───────┤
│ 1   │ C      │ 1     │ 5     │ 1     │ 5     │
│ 2   │ D      │ 2     │ 6     │ 2     │ 6     │
│ 3   │ C      │ 3     │ 7     │ 3     │ 7     │
│ 4   │ D      │ 4     │ 8     │ 4     │ 8     │

Thank you in advance!!

No, that’s not supported. Have a look at IndexedTables’s NDSparse type instead.

Can you clarify what you are trying to do with the data in this way? I don’t really understand your second data frame

               |       A       |       B
│ Row │ Cat2   │ val1  │ val2  │ val1  │ val2  │
│     │ String │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┼───────┼───────┤
│ 1   │ C      │ 1     │ 5     │ 1     │ 5     │
│ 2   │ D      │ 2     │ 6     │ 2     │ 6     │
│ 3   │ C      │ 3     │ 7     │ 3     │ 7     │
│ 4   │ D      │ 4     │ 8     │ 4     │ 8     │

The last two rows in the A :val2 column don’t make sense because :val2 never equals 7 or 8 on an "A" observation.

Sorry, this example is maybe oversimplified. In my real application I have a large dataframe with a lot of computational results which depend on several parameters. I obtain these results with different computational models which I like to compare. So I group by parameters and models, e.g.,

aggdf = by(df, [:model, :param1, :param2, :param3],
            nopt = :opt => sum,
            gap  = :gap => mean,
            rt   = :rt => mean,
            )

This works fine, however, the resulting dataframe lists the results of each model vertically. Since I want to figure out which model performs best for a specific parameter combination, it would be much nicer for me to compare the horizontally (and I would also create a table like that for a publication with multirow/multicol in Latex).
Clearly, I could split the dataframe accordingly and append them next to each other. But in this case I would also need to rename the columns, e.g., nopt1, gap1,…

Thank you, I will give it a try. However, I think this would be a nice feature also for DataFrames.

Ah I understand.

I think the best method is just to have :A_param1, :B_param1 etc. in your dataframes. Other alternatives could be to make a custom struct which holds all your parameters and have each observation in that column be one of those.

I see your point about exporting to latex. There isn’t a good replacement for something as complicated as R’s ggtable which would allow grouped columns like that.

Thinking about adding this feature to DataFrames would probably cause a lot of headaches in an already complicated indexing story, so groupings like this would probably have to be done at the latex printing stage, in another package.

I see,… thank you.

I was hoping that there is an easier option (although I found nothing in the docs), since I know that at least with pandas such kind of grouping works fine with unstack().

I can see the appeal. DataFrames is currently looking for a better API for stack / melt as evidenced by this issue.

2 Likes

Shameless self-promotion here, but if you’re looking for a nice way to generate grouped summary stats from Dataframes and print them out to latex, I have a package called TexTables that does this already works fairly well for this sort of stuff. There’s a function called summarize_by that might do what you want right out of the box, or it’s fairly easy to use it to construct the tables of summary statistics programatically to look exactly how you want them.

2 Likes

Thank you, this looks promising at the first glance. I will give it a try.

If you wind up trying it let me know if there’s anything that it should do that it doesn’t, or if you run into bugs. I’ve been fixing things to fit my use cases over time, but it’s always useful to get feedback from other people.

1 Like