Adding calculated columns to dataframe

Hello Folks,

I have a DF that constructed as follows:

DF = DataFrame(ID = 1:10, Col1 = rand(15:0.01:40,10), Col2 = rand(30:0.01:55,10), 
Col3 = rand(20:0.01:65,10)...)

I am attempting to do the following:

  1. Add columns that calculate Min, Max, IQR, Mean, Median, Variance, Std. across the columns. I know I must use (using Statistics) and apply the methods, but would like to
    add the calculated columns to the existing DF.

Something like

insertcols!(DF, 4, :Min => min.(DF[:.1:3]))

  1. Is there anyway to plot mean values for each row with the Min and Max as upper and lower boundaries on the same plot?

Thank you,

julia> df
10Γ—3 DataFrame
 Row β”‚ x1         x2        x3       
     β”‚ Float64    Float64   Float64  
─────┼───────────────────────────────
   1 β”‚ 0.63686    0.181631  0.860774
   2 β”‚ 0.441555   0.811134  0.254945
   3 β”‚ 0.396693   0.676834  0.20931
   4 β”‚ 0.96424    0.267536  0.962784
   5 β”‚ 0.288845   0.822987  0.803151
   6 β”‚ 0.193365   0.886698  0.972179
   7 β”‚ 0.992748   0.636317  0.527519
   8 β”‚ 0.427211   0.473838  0.847242
   9 β”‚ 0.0215964  0.742059  0.32778
  10 β”‚ 0.396853   0.201746  0.506313

julia> transform!(df, AsTable(:) => ByRow(maximum))
10Γ—4 DataFrame
 Row β”‚ x1         x2        x3        x1_x2_x3_maximum 
     β”‚ Float64    Float64   Float64   Float64          
─────┼─────────────────────────────────────────────────
   1 β”‚ 0.63686    0.181631  0.860774          0.860774
   2 β”‚ 0.441555   0.811134  0.254945          0.811134
   3 β”‚ 0.396693   0.676834  0.20931           0.676834
   4 β”‚ 0.96424    0.267536  0.962784          0.96424
   5 β”‚ 0.288845   0.822987  0.803151          0.822987
   6 β”‚ 0.193365   0.886698  0.972179          0.972179
   7 β”‚ 0.992748   0.636317  0.527519          0.992748
   8 β”‚ 0.427211   0.473838  0.847242          0.847242
   9 β”‚ 0.0215964  0.742059  0.32778           0.742059
  10 β”‚ 0.396853   0.201746  0.506313          0.506313

But I want to stress that, this is anti-columnar so you might want to use DataFrameMetas.jl to speed things up. Basically, if you need everything on each row, you’re doing row-based analysis.

Thank you.

Am wondering – if I use DataFrameMeta.jl do I need to qualify anything in the line you provided or the system automatically references the package that will execute optimally?

no, you need to re-write it using things from DataFramesMeta

Okay will explore a little, working with small set – so performance should not be major issue with the instruction you provided.

Quick follow-up,

Do you know how I could round the calculated column in your example?

I am testing:

transform!(UnitsPer2, AsTable(:) => round.(ByRow(std), digits=3) => :StdDev)

and attempted:

round.(transform!(UnitsPer2, AsTable(:) => ByRow(std) => :StdDev, digits =3))

Any suggestions? Is all I need to complete this exploration.

the stuff inside ByRow is a function that will be passed the entire row as the argument. I’m guessing you need

ByRow(row -> round(std(row); digits=3))

This is currently AsTable(:) => ByRow(maximum) is currently not something you can do in DataFramesMeta.jl.

I’m advising to do a for loop to fill columns and put those columns back to the original dataframe. the for-loop part meta can help.

1 Like

To your earlier point,

The ByRow functions seems to work on all row
elements, instead of the column subset of interest.
I suppose I could fill AsTable(2:4), to specify where
to perform the row-wise operation, yes?

1 Like

Applying your method, for rounding as

transform!(DF, AsTable(2:4) => ByRow(row -> round.(var(row); digits =3)) => :Var)

Works well~

Yes – I now see. The formulations I put together are:

transform!(DF, AsTable(2:4) => ByRow(minimum) => :Min)
transform!(DF, AsTable(2:4) => ByRow(maximum) => :Max)
transform!(DF, AsTable(2:4) => 
	       ByRow(row -> round.(var(row); digits =3)) => :Var)
transform!(DF, AsTable(2:4) => 
	       ByRow(row -> round.(std(row); digits =3)) => :StdDev)

Perhaps you were thinking of something more streamlined?

You can simplify it like this:

using DataFrames
using Statistics

DF = DataFrame(ID = 1:10, Col1 = rand(15:0.01:40,10),
                          Col2 = rand(30:0.01:55,10),
                          Col3 = rand(20:0.01:65,10))

round3(x) = round(x, digits=3)

transform(DF, AsTable(2:4) .=>
              ByRow.([minimum, maximum, round3∘var, round3∘std]) .=>
              [:Min, :Max, :Var, :StdDev])

# Output
10Γ—8 DataFrame
 Row β”‚ ID     Col1     Col2     Col3     Min      Max      Var      StdDev  
     β”‚ Int64  Float64  Float64  Float64  Float64  Float64  Float64  Float64 
─────┼──────────────────────────────────────────────────────────────────────
   1 β”‚     1    16.83    40.6     41.94    16.83    41.94  199.553   14.126
   2 β”‚     2    39.26    51.72    42.11    39.26    51.72   42.621    6.528
   3 β”‚     3    16.8     46.58    36.69    16.8     46.58  230.045   15.167
   4 β”‚     4    31.01    48.33    46.44    31.01    48.33   90.273    9.501
   5 β”‚     5    35.92    41.29    39.67    35.92    41.29    7.587    2.755
   6 β”‚     6    23.63    54.91    29.67    23.63    54.91  275.33    16.593
   7 β”‚     7    38.69    38.05    47.62    38.05    47.62   28.623    5.35
   8 β”‚     8    32.24    38.45    30.99    30.99    38.45   15.963    3.995
   9 β”‚     9    36.48    35.7     24.4     24.4     36.48   45.704    6.76
  10 β”‚    10    15.17    49.39    35.62    15.17    49.39  296.471   17.218
1 Like

Thank you for sharing this, in my example I wanted to make
a deep copy over-write so I included (!). Also, wanted to show
a second :StDev column but *2. I attempted:

round3(x) = round(x, digits=3)
transform!(DF, AsTable(2:4) .=>
              ByRow.([minimum, maximum, mean, round3∘var, 
				      round3∘std ]) .=>
              [:Min, :Max, :Var, :Mean, :StdDev, :StdDev2])
transform!(DF, AsTable(2:4) => 
           ByRow(row -> (round.(std(row); digits =3)) .*2) 
		   => :StdDev2)

Any tips?

I think the broadcast function would be better here, as our colleague has expressed. For loops from what I read are not good design features for larger code, though we are working with small tables here.

the opposite, when your data doesn’t fit in RAM, for-loop sometimes is the only reasonable choice that is also flexible enough (i.e. not .this().that().compute()

Thank you.

Could you point me to a resource?

Try to understand how broadcasting works in this code. For that it’s super helpful to execute the code piece by piece.

AsTable(2:4) .=>
ByRow.([minimum, maximum, round3∘var, round3∘std]) .=>
[:Min, :Max, :Var, :StdDev]

The first line has a scalar and the second line makes a vector of 4 β€œByRow” functions:

julia> ByRow.([minimum, maximum, round3∘var, round3∘std]) 
4-element Vector{ByRow}:
 (::ByRow{typeof(minimum)}) (generic function with 2 methods)
 (::ByRow{typeof(maximum)}) (generic function with 2 methods)
 (::ByRow{ComposedFunction{typeof(round3), typeof(var)}}) (generic function with 2 methods)
 (::ByRow{ComposedFunction{typeof(round3), typeof(std)}}) (generic function with 2 methods)

and the third line is a vector of 4 names. So by broadcasting => we are simply creating a vector of 4 β€œpairs of pairs”. See what we get if we execute the whole expression given as parameter to transform!:

julia> AsTable(2:4) .=>
       ByRow.([minimum, maximum, round3∘var, round3∘std]) .=>
       [:Min, :Max, :Var, :StdDev]

4-element Vector{Pair{AsTable}}:
 AsTable(2:4) => (ByRow{typeof(minimum)}(minimum) => :Min)
 AsTable(2:4) => (ByRow{typeof(maximum)}(maximum) => :Max)
 AsTable(2:4) => (ByRow{ComposedFunction{typeof(round3), typeof(var)}}(round3 ∘ Statistics.var) => :Var)
 AsTable(2:4) => (ByRow{ComposedFunction{typeof(round3), typeof(std)}}(round3 ∘ Statistics.std) => :StdDev)

So to make it work just make sure the β€œByRow” vector and the names vector have the same size, and the order of β€œByRow” functions corresponds to the names (in your code you have the mean function before var but the names in the reverse order):

transform!(DF, AsTable(2:4) .=>
               ByRow.([minimum, maximum, mean, round3∘var, round3∘std, round3∘(x->2x)∘std]) .=>
               [:Min, :Max, :Mean, :Var, :StdDev, :StdDev2])
2 Likes

Thank you very much for your thorough explanation.

Prior to reviewing it, I was able to catch the symbol
reversal and made the correction.

However, what you said provided needed context.

Best,

1 Like