Data manipulation using dataframes

mjanun · March 25, 2021, 9:45pm

Assume that a dataframe is given as:

df_a = DataFrame(Categ=["Alpa","Beta","Gamma","Omega","Alpa","Beta","Gamma","Omega"],
T1 =rand(8),T2=rand(8),T3 =rand(8),T4 =rand(8),T5 =rand(8),T6 =rand(8),T7 =rand(8),T8 =rand(8),T9 =rand(8),T10 =rand(8),T11 =rand(8),T12 =rand(8))

8×13 DataFrame. Omitted printing of 4 columns
│ Row │ Categ  │ T1        │ T2       │ T3        │ T4       │ T5        │ T6          │ T7        │ T8        │
│     │ String │ Float64   │ Float64  │ Float64   │ Float64  │ Float64   │ Float64     │ Float64   │ Float64   │
├─────┼────────┼───────────┼──────────┼───────────┼──────────┼───────────┼─────────────┼───────────┼───────────┤
│ 1   │ Alpa   │ 0.673424  │ 0.786929 │ 0.0893057 │ 0.568696 │ 0.145572  │ 0.653375    │ 0.21898   │ 0.361053  │
│ 2   │ Beta   │ 0.0538812 │ 0.360114 │ 0.699175  │ 0.323722 │ 0.161717  │ 0.000336349 │ 0.542039  │ 0.848293  │
│ 3   │ Gamma  │ 0.333272  │ 0.631906 │ 0.275669  │ 0.957596 │ 0.428401  │ 0.641091    │ 0.256033  │ 0.574052  │
│ 4   │ Omega  │ 0.423221  │ 0.346942 │ 0.817672  │ 0.761629 │ 0.861805  │ 0.0104372   │ 0.0955495 │ 0.300919  │
│ 5   │ Alpa   │ 0.360478  │ 0.123437 │ 0.0864159 │ 0.486619 │ 0.240639  │ 0.791122    │ 0.157081  │ 0.0159331 │
│ 6   │ Beta   │ 0.853104  │ 0.913458 │ 0.135472  │ 0.387773 │ 0.190978  │ 0.542373    │ 0.0581868 │ 0.339313  │
│ 7   │ Gamma  │ 0.699581  │ 0.356128 │ 0.477584  │ 0.565736 │ 0.338906  │ 0.625697    │ 0.845379  │ 0.487111  │
│ 8   │ Omega  │ 0.144818  │ 0.451992 │ 0.212729  │ 0.990052 │ 0.0205443 │ 0.00236285  │ 0.341749  │ 0.120135  │

I am looking to add rows based on column Categ. For this, I do the following:

names_original = names(df_a,Not("Categ"))
df_b = combine(groupby(df_a, "Categ"), names(df_a, names_original) .=> sum)

4×13 DataFrame. Omitted printing of 3 columns
│ Row │ Categ  │ T1_sum   │ T2_sum   │ T3_sum   │ T4_sum   │ T5_sum   │ T6_sum   │ T7_sum   │ T8_sum   │ T9_sum   │
│     │ String │ Float64  │ Float64  │ Float64  │ Float64  │ Float64  │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ Alpa   │ 1.0339   │ 0.910366 │ 0.175722 │ 1.05532  │ 0.38621  │ 1.4445   │ 0.37606  │ 0.376986 │ 0.674029 │
│ 2   │ Beta   │ 0.906985 │ 1.27357  │ 0.834647 │ 0.711494 │ 0.352695 │ 0.542709 │ 0.600225 │ 1.18761  │ 1.34625  │
│ 3   │ Gamma  │ 1.03285  │ 0.988034 │ 0.753253 │ 1.52333  │ 0.767307 │ 1.26679  │ 1.10141  │ 1.06116  │ 1.13612  │
│ 4   │ Omega  │ 0.568039 │ 0.798934 │ 1.0304   │ 1.75168  │ 0.882349 │ 0.0128   │ 0.437299 │ 0.421054 │ 0.937023 │

Then I need to transpose this dataframe

dfTranspose =  DataFrame([[names(df_b)]; collect.(eachrow(df_b))], [:column; Symbol.(axes(df_b, 1))])

 Row │ column  │ 1        │ 2        │ 3        │ 4        │
│     │ Any     │ Any      │ Any      │ Any      │ Any      │
├─────┼─────────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ Categ   │ Alpa     │ Beta     │ Gamma    │ Omega    │
│ 2   │ T1_sum  │ 1.0339   │ 0.906985 │ 1.03285  │ 0.568039 │
│ 3   │ T2_sum  │ 0.910366 │ 1.27357  │ 0.988034 │ 0.798934 │
│ 4   │ T3_sum  │ 0.175722 │ 0.834647 │ 0.753253 │ 1.0304   │
│ 5   │ T4_sum  │ 1.05532  │ 0.711494 │ 1.52333  │ 1.75168  │
│ 6   │ T5_sum  │ 0.38621  │ 0.352695 │ 0.767307 │ 0.882349 │
│ 7   │ T6_sum  │ 1.4445   │ 0.542709 │ 1.26679  │ 0.0128   │
│ 8   │ T7_sum  │ 0.37606  │ 0.600225 │ 1.10141  │ 0.437299 │
│ 9   │ T8_sum  │ 0.376986 │ 1.18761  │ 1.06116  │ 0.421054 │
│ 10  │ T9_sum  │ 0.674029 │ 1.34625  │ 1.13612  │ 0.937023 │
│ 11  │ T10_sum │ 1.18364  │ 0.917098 │ 1.18644  │ 1.40851  │
│ 12  │ T11_sum │ 1.05572  │ 0.476368 │ 1.40192  │ 0.77439  │
│ 13  │ T12_sum │ 0.519706 │ 1.01044  │ 1.11613  │ 1.50852  │

Annoyingly, the first row contains Categ which should be in headers and suffix “sum” is added to T. I am looking to:

Produce a dataframe similar to the above but the entries in first row should be moved to column names, i.e. column names should be alpha, beta, gamma, omega.
Add another column (called Total) to the dataframe that is calculated as: dfTranspose.Alpha + dfTranspose.Omega - dfTranspose.Beta

Any ideas?

sijo · March 25, 2021, 10:32pm

What about this:

using DataFrames

df_a = DataFrame(Categ=["Alpha", "Beta", "Gamma", "Omega", "Alpha", "Beta", "Gamma", "Omega"],
       T1=rand(8), T2=rand(8), T3=rand(8), T4=rand(8), T5=rand(8), T6=rand(8), T7=rand(8), T8=rand(8), T9=rand(8), T10=rand(8), T11=rand(8), T12=rand(8))

names_original = names(df_a,Not("Categ"))

df_b = combine(groupby(df_a, "Categ"), names_original .=> sum, renamecols=false)

dfTranspose = DataFrame(Array(df_b[:, 2:end])', df_b.Categ)

dfTranspose.Total = dfTranspose.Alpha + dfTranspose.Omega - dfTranspose.Beta

julia> dfTranspose
12×5 DataFrame
 Row │ Alpha      Beta      Gamma     Omega     Total      
     │ Float64    Float64   Float64   Float64   Float64    
─────┼─────────────────────────────────────────────────────
   1 │ 1.34795    0.550981  1.13917   1.53287    2.32984
   2 │ 1.39466    1.92737   0.950801  0.518323  -0.0143839
   3 │ 1.17082    1.36842   1.08143   1.31744    1.11985
   4 │ 1.49688    1.22459   1.42681   0.949872   1.22217
   5 │ 0.034162   1.70191   1.59025   1.14753   -0.520226
   6 │ 1.70969    1.16556   1.01212   0.640301   1.18443
   7 │ 0.0644133  1.19946   1.81257   1.62543    0.490385
   8 │ 0.865458   1.72887   1.07256   1.11104    0.24763
   9 │ 0.977002   0.747054  1.16972   1.04381    1.27376
  10 │ 0.461294   1.35444   0.963941  0.42828   -0.464863
  11 │ 0.341654   1.0712    0.980314  1.32837    0.598828
  12 │ 1.78162    1.67029   1.1538    1.83945    1.95079

pdeffebach · March 26, 2021, 12:01am

You want

permutedims(df, 1, "Total")

mjanun · March 26, 2021, 12:14am

Thank you

mjanun · March 26, 2021, 12:28am

Thanks. permutedims looks like a new addition to DataFrames.

Would you know if there is a way of selecting all columns between two columns (excluding those two columns)? e.g.,
names(df_b,Between(:T1,:T6)) returns all names T1, T2,…T6, but I only want T2,…T5 (i.e. not including T1 and T6). The way I get around this is to use columnindex to find the column number and then use Between
names(df_b,Between(columnindex(df_b,:T1)+1,columnindex(df_b,:T6)-1))

pdeffebach · March 26, 2021, 3:15am

No, I don’t think there is a way to do that kind of exclusive Between. Your solution sounds reasonable.

sijo · March 26, 2021, 8:20am

You could also use names(df_a, Between(:T1, :T12))[2:end-1].

mjanun · March 27, 2021, 1:07pm

Thank you

Topic		Replies	Views
How to add rows in a DataFrame Data dataframes	5	490	December 3, 2020
Recommended way to select multiple DataFrame columns parametrically, i.e. not by the names New to Julia dataframes	9	2400	January 8, 2022
Broadcast transformed data from single row to multiple columns General Usage dataframes , dataframesmeta	13	571	December 7, 2022
Efficient way to add column to dataframe computed from prior columns New to Julia question	6	2309	August 12, 2021
Combine(Merge) Columns New to Julia dataframes	2	2559	February 28, 2021

Data manipulation using dataframes

Related topics