Data manipulation using dataframes

Assume that a dataframe is given as:

df_a = DataFrame(Categ=["Alpa","Beta","Gamma","Omega","Alpa","Beta","Gamma","Omega"],
T1 =rand(8),T2=rand(8),T3 =rand(8),T4 =rand(8),T5 =rand(8),T6 =rand(8),T7 =rand(8),T8 =rand(8),T9 =rand(8),T10 =rand(8),T11 =rand(8),T12 =rand(8))
8×13 DataFrame. Omitted printing of 4 columns
│ Row │ Categ  │ T1        │ T2       │ T3        │ T4       │ T5        │ T6          │ T7        │ T8        │
│     │ String │ Float64   │ Float64  │ Float64   │ Float64  │ Float64   │ Float64     │ Float64   │ Float64   │
├─────┼────────┼───────────┼──────────┼───────────┼──────────┼───────────┼─────────────┼───────────┼───────────┤
│ 1   │ Alpa   │ 0.673424  │ 0.786929 │ 0.0893057 │ 0.568696 │ 0.145572  │ 0.653375    │ 0.21898   │ 0.361053  │
│ 2   │ Beta   │ 0.0538812 │ 0.360114 │ 0.699175  │ 0.323722 │ 0.161717  │ 0.000336349 │ 0.542039  │ 0.848293  │
│ 3   │ Gamma  │ 0.333272  │ 0.631906 │ 0.275669  │ 0.957596 │ 0.428401  │ 0.641091    │ 0.256033  │ 0.574052  │
│ 4   │ Omega  │ 0.423221  │ 0.346942 │ 0.817672  │ 0.761629 │ 0.861805  │ 0.0104372   │ 0.0955495 │ 0.300919  │
│ 5   │ Alpa   │ 0.360478  │ 0.123437 │ 0.0864159 │ 0.486619 │ 0.240639  │ 0.791122    │ 0.157081  │ 0.0159331 │
│ 6   │ Beta   │ 0.853104  │ 0.913458 │ 0.135472  │ 0.387773 │ 0.190978  │ 0.542373    │ 0.0581868 │ 0.339313  │
│ 7   │ Gamma  │ 0.699581  │ 0.356128 │ 0.477584  │ 0.565736 │ 0.338906  │ 0.625697    │ 0.845379  │ 0.487111  │
│ 8   │ Omega  │ 0.144818  │ 0.451992 │ 0.212729  │ 0.990052 │ 0.0205443 │ 0.00236285  │ 0.341749  │ 0.120135  │

I am looking to add rows based on column Categ. For this, I do the following:

names_original = names(df_a,Not("Categ"))
df_b = combine(groupby(df_a, "Categ"), names(df_a, names_original) .=> sum)
4×13 DataFrame. Omitted printing of 3 columns
│ Row │ Categ  │ T1_sum   │ T2_sum   │ T3_sum   │ T4_sum   │ T5_sum   │ T6_sum   │ T7_sum   │ T8_sum   │ T9_sum   │
│     │ String │ Float64  │ Float64  │ Float64  │ Float64  │ Float64  │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ Alpa   │ 1.0339   │ 0.910366 │ 0.175722 │ 1.05532  │ 0.38621  │ 1.4445   │ 0.37606  │ 0.376986 │ 0.674029 │
│ 2   │ Beta   │ 0.906985 │ 1.27357  │ 0.834647 │ 0.711494 │ 0.352695 │ 0.542709 │ 0.600225 │ 1.18761  │ 1.34625  │
│ 3   │ Gamma  │ 1.03285  │ 0.988034 │ 0.753253 │ 1.52333  │ 0.767307 │ 1.26679  │ 1.10141  │ 1.06116  │ 1.13612  │
│ 4   │ Omega  │ 0.568039 │ 0.798934 │ 1.0304   │ 1.75168  │ 0.882349 │ 0.0128   │ 0.437299 │ 0.421054 │ 0.937023 │

Then I need to transpose this dataframe

dfTranspose =  DataFrame([[names(df_b)]; collect.(eachrow(df_b))], [:column; Symbol.(axes(df_b, 1))])
 Row │ column  │ 1        │ 2        │ 3        │ 4        │
│     │ Any     │ Any      │ Any      │ Any      │ Any      │
├─────┼─────────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ Categ   │ Alpa     │ Beta     │ Gamma    │ Omega    │
│ 2   │ T1_sum  │ 1.0339   │ 0.906985 │ 1.03285  │ 0.568039 │
│ 3   │ T2_sum  │ 0.910366 │ 1.27357  │ 0.988034 │ 0.798934 │
│ 4   │ T3_sum  │ 0.175722 │ 0.834647 │ 0.753253 │ 1.0304   │
│ 5   │ T4_sum  │ 1.05532  │ 0.711494 │ 1.52333  │ 1.75168  │
│ 6   │ T5_sum  │ 0.38621  │ 0.352695 │ 0.767307 │ 0.882349 │
│ 7   │ T6_sum  │ 1.4445   │ 0.542709 │ 1.26679  │ 0.0128   │
│ 8   │ T7_sum  │ 0.37606  │ 0.600225 │ 1.10141  │ 0.437299 │
│ 9   │ T8_sum  │ 0.376986 │ 1.18761  │ 1.06116  │ 0.421054 │
│ 10  │ T9_sum  │ 0.674029 │ 1.34625  │ 1.13612  │ 0.937023 │
│ 11  │ T10_sum │ 1.18364  │ 0.917098 │ 1.18644  │ 1.40851  │
│ 12  │ T11_sum │ 1.05572  │ 0.476368 │ 1.40192  │ 0.77439  │
│ 13  │ T12_sum │ 0.519706 │ 1.01044  │ 1.11613  │ 1.50852  │

Annoyingly, the first row contains Categ which should be in headers and suffix “sum” is added to T. I am looking to:

  1. Produce a dataframe similar to the above but the entries in first row should be moved to column names, i.e. column names should be alpha, beta, gamma, omega.
  2. Add another column (called Total) to the dataframe that is calculated as: dfTranspose.Alpha + dfTranspose.Omega - dfTranspose.Beta

Any ideas?

What about this:

using DataFrames

df_a = DataFrame(Categ=["Alpha", "Beta", "Gamma", "Omega", "Alpha", "Beta", "Gamma", "Omega"],
       T1=rand(8), T2=rand(8), T3=rand(8), T4=rand(8), T5=rand(8), T6=rand(8), T7=rand(8), T8=rand(8), T9=rand(8), T10=rand(8), T11=rand(8), T12=rand(8))

names_original = names(df_a,Not("Categ"))

df_b = combine(groupby(df_a, "Categ"), names_original .=> sum, renamecols=false)

dfTranspose = DataFrame(Array(df_b[:, 2:end])', df_b.Categ)

dfTranspose.Total = dfTranspose.Alpha + dfTranspose.Omega - dfTranspose.Beta

julia> dfTranspose
12×5 DataFrame
 Row │ Alpha      Beta      Gamma     Omega     Total      
     │ Float64    Float64   Float64   Float64   Float64    
─────┼─────────────────────────────────────────────────────
   1 │ 1.34795    0.550981  1.13917   1.53287    2.32984
   2 │ 1.39466    1.92737   0.950801  0.518323  -0.0143839
   3 │ 1.17082    1.36842   1.08143   1.31744    1.11985
   4 │ 1.49688    1.22459   1.42681   0.949872   1.22217
   5 │ 0.034162   1.70191   1.59025   1.14753   -0.520226
   6 │ 1.70969    1.16556   1.01212   0.640301   1.18443
   7 │ 0.0644133  1.19946   1.81257   1.62543    0.490385
   8 │ 0.865458   1.72887   1.07256   1.11104    0.24763
   9 │ 0.977002   0.747054  1.16972   1.04381    1.27376
  10 │ 0.461294   1.35444   0.963941  0.42828   -0.464863
  11 │ 0.341654   1.0712    0.980314  1.32837    0.598828
  12 │ 1.78162    1.67029   1.1538    1.83945    1.95079

You want

permutedims(df, 1, "Total")
1 Like

Thank you

Thanks. permutedims looks like a new addition to DataFrames.

Would you know if there is a way of selecting all columns between two columns (excluding those two columns)? e.g.,
names(df_b,Between(:T1,:T6)) returns all names T1, T2,…T6, but I only want T2,…T5 (i.e. not including T1 and T6). The way I get around this is to use columnindex to find the column number and then use Between
names(df_b,Between(columnindex(df_b,:T1)+1,columnindex(df_b,:T6)-1))

No, I don’t think there is a way to do that kind of exclusive Between. Your solution sounds reasonable.

You could also use names(df_a, Between(:T1, :T12))[2:end-1].

Thank you