Dataframe unstack() column order

Given this dataframe holding counts of a product (actually HPC compute partition) per week:

julia> describe(df_agg)
3×7 DataFrame
 Row │ variable  mean     min     median  max          nmissing  eltype   
     │ Symbol    Union…   Any     Union…  Any          Int64     DataType 
─────┼────────────────────────────────────────────────────────────────────
   1 │ product            cbuild          thin_course         0  String15
   2 │ week      26.826   1       28.0    48                  0  Int32
   3 │ count     3715.95  1       53.0    102000              0  Int64

julia> df_agg
684×3 DataFrame
 Row │ product      week   count 
     │ String15     Int32  Int64 
─────┼───────────────────────────
   1 │ cbuild           1     26
   2 │ cbuild           2      2
   3 │ cbuild           4      7
   4 │ cbuild           5     31
   5 │ cbuild           6     68
   6 │ cbuild           7     14
   7 │ cbuild          10      5
   8 │ cbuild          11      9
   9 │ cbuild          12      7
  10 │ cbuild          13     12
  11 │ cbuild          14     18
  12 │ cbuild          15      7
  13 │ cbuild          16      1
  14 │ cbuild          17      9
  15 │ cbuild          18     45
  16 │ cbuild          19      2
  17 │ cbuild          20      1
  18 │ cbuild          21      5
  19 │ cbuild          22      6
  20 │ cbuild          23      2
  21 │ cbuild          25     54
  22 │ cbuild          26     11
  ⋮  │      ⋮         ⋮      ⋮
 664 │ thin_course     19      1
 665 │ thin_course     20    198
 666 │ thin_course     25      3
 667 │ thin_course     27      3
 668 │ thin_course     30      1
 669 │ thin_course     31      3
 670 │ thin_course     32     13
 671 │ thin_course     35    127
 672 │ thin_course     36    947
 673 │ thin_course     37    766
 674 │ thin_course     38    389
 675 │ thin_course     39    413
 676 │ thin_course     40    488
 677 │ thin_course     41    488
 678 │ thin_course     42    371
 679 │ thin_course     43    464
 680 │ thin_course     44    480
 681 │ thin_course     45    448
 682 │ thin_course     46    588
 683 │ thin_course     47   1402
 684 │ thin_course     48   1755
                 641 rows omitted

I’m surprised to see unstack() producing columns that are almost sorted by not quite:

julia> df = unstack(df_agg, :product, :week, :count)
22×49 DataFrame
 Row │ product         1        2        4        5        6        7        10       11       12       13       14       15       16       17       18       19       20       21       22       23       25     ⋯
     │ String15        Int64?   Int64?   Int64?   Int64?   Int64?   Int64?   Int64?   Int64?   Int64?   Int64?   Int64?   Int64?   Int64?   Int64?   Int64?   Int64?   Int64?   Int64?   Int64?   Int64?   Int64? ⋯
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ cbuild               26        2        7       31       68       14        5        9        7       12       18        7        1        9       45        2        1        5        6        2       5 ⋯
   2 │ course          missing  missing  missing  missing  missing  missing       25       12  missing       17        3        1  missing  missing  missing  missing  missing  missing  missing  missing  missin
   3 │ fat                1368     3710    11012     2884     2591     1878     2380     4090     2710     2206     3380     2142     1564     1144     1358     2428     3207     1584     1647     1808     109
   4 │ fat_genoa       missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missin
   5 │ fat_rome        missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missin ⋯
   6 │ genoa           missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missin
   7 │ gpu                3813    10054     4721     4137     7734     5209     5642     9605     8313    10997     4648     6707     7428    10615     4735     8705    10304     8384     7590     8189    1020
   8 │ gpu_course      missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing      521        4  missing        4       1
   9 │ gpu_mig         missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missin ⋯
  10 │ gpu_sw               51       25       10  missing       20       30  missing        8       25       23       19        8       10  missing        7      132       19       36       32       51       5
  11 │ gpu_test              3       14  missing  missing       70       42      103       22       32       75       93      100       61      285       46       65      111       59       55      101
  12 │ gpu_vis               4        8       29        5       24       18       12        3        4        5        2       12       19       19       12       13       11       27       43       17       3
  13 │ himem_4tb            14      163       15       22       87       14       15       14       13       29       13       21       29       25        9       48       19       25       56       83       4 ⋯
  14 │ himem_8tb            14       39        1       14       16       14       17       12       12       26       12       12       37       12        2       30       28       16       35       29       2
  15 │ rome            missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missin
  16 │ staging              46      274      570      198      191      498      203      199      290       65       88      147      581      224      276      892      960      576      928      519      22
  17 │ staging_course  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing       15       51        7        1        7         ⋯
  18 │ sw                   45       23        9        6       19       27        3       23       23       20       16       14       15        3        9       63       18       42       41       49       2
  19 │ sw_genoa        missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missin
  20 │ test            missing        4       37       17       25        4       24       44       40       23       86      134      131       31        2       15       52       23  missing        3       2
  21 │ thin              24239    25004    22318    18389    30568    35804    34150    24772    22519    17566    28003    35909    23188    16849    13813    24210    28992    19076    24878    46839    1824 ⋯
  22 │ thin_course     missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing  missing        1      198  missing  missing  missing 
                                                                                                                                                                                                 28 columns omitted

julia> names(df)
49-element Vector{String}:
 "product"
 "1"
 "2"
 "4"
 "5"
 "6"
 "7"
 "10"
 "11"
 "12"
 "13"
 "14"
 "15"
 "16"
 "17"
 "18"
 "19"
 "20"
 "21"
 "22"
 "23"
 "25"
 "26"
 "27"
 ⋮
 "30"
 "31"
 "32"
 "33"
 "34"
 "35"
 "36"
 "37"
 "38"
 "39"
 "40"
 "41"
 "42"
 "43"
 "44"
 "45"
 "46"
 "47"
 "48"
 "8"
 "9"
 "3"
 "24"

In fact, I only noticed the column order (week) was off after my plots of the data showed weird time series, as the printed dataframe above looks fine for the columns visible on my screen.

Is there no guarantee on the column order produced by unstack()? I don’t see a relevant keyword mentioned in the docs, nor any issue in the DataFrames.jl repo, so maybe I was just making an incorrect assumption about the output order.

I guess it repeats whatever order it finds in the week column? E.g.:

julia> df = DataFrame(product = ["cbuild", "cbuild", "cbuild", "course", "course", "course"],
                      week = [1, 3, 4, 1, 2, 3],
                      count = [10, 10, 10, 10, 10, 10]
                      )
6×3 DataFrame
 Row │ product  week   count
     │ String   Int64  Int64
─────┼───────────────────────
   1 │ cbuild       1     10
   2 │ cbuild       3     10
   3 │ cbuild       4     10
   4 │ course       1     10
   5 │ course       2     10
   6 │ course       3     10

julia> unstack(df, :product, :week, :count)
2×5 DataFrame
 Row │ product  1       3       4        2
     │ String   Int64?  Int64?  Int64?   Int64?
─────┼───────────────────────────────────────────
   1 │ cbuild       10      10       10  missing
   2 │ course       10      10  missing       10

Okay, so it builds up the column list as it goes, avoiding duplicate columns. So I got somewhat lucky in that almost all weeks are listed for the first product, but not fully. Thanks!

Looks like it, the easiest thing is probably just a

select!(df_wide, :product, sort(names(df_wide, Not(:product))))

to get what you want.

1 Like

That’s a more compact sorting step than I came up with, thanks :slight_smile: Needs a by=x -> parse(Int, x) as the week numbers have turned into strings for the column names.

The PR adding what you ask for is add sortrows and sortcols to unstack by bkamins · Pull Request #3395 · JuliaData/DataFrames.jl · GitHub

2 Likes