Transform operation using two or more columns in a DataFrame

Hi I am wondering how to use transform!() or transform!() + ByRow() functions to do computations utilizing two or more columns

Essentially I want to perform the the function below however using transform!()

data = Dict(
    :age => Array{Int64}([0,1,2,3,4,5,6,7,8,9,10,11]),
    :female_pop => Array{Float64}([48283,47195,48140,50017,49189,49543,49951,49758,49728,49909,49051,49233]),
    :female_deaths => Array{Int64}([136,8,3,5,2,6,6,2,2,3,2,1])
)

df = DataFrame(data)

# The conditional probability of death
qₓ(Pₓ,Dₓ) = Dₓ ./ (Pₓ .+ 0.5 * Dₓ) 

df[:,:cond_prob] = qₓ(df.female_pop, df.female_deaths) 

I have tried this using the following code but I get an error

transform!(
    df,
    [:female_pop, :female_deaths] => ByRow(
        x -> x.female_deaths / (x.female_pop + 0.5 * x.female_deaths)
    ) => :cond_prob2
)

For this I get an MethodError. ANy ideas how I go about this?

Use either of the two below:

julia> transform(
           df,
           AsTable([:female_pop, :female_deaths]) => ByRow(
               x -> x.female_deaths / (x.female_pop + 0.5 * x.female_deaths)
           ) => :cond_prob2
       )
12×5 DataFrame
 Row │ age    female_deaths  female_pop  cond_prob    cond_prob2
     │ Int64  Int64          Float64     Float64      Float64
─────┼────────────────────────────────────────────────────────────
   1 │     0            136     48283.0  0.00281276   0.00281276
   2 │     1              8     47195.0  0.000169495  0.000169495
   3 │     2              3     48140.0  6.23163e-5   6.23163e-5
   4 │     3              5     50017.0  9.9961e-5    9.9961e-5
   5 │     4              2     49189.0  4.06587e-5   4.06587e-5
   6 │     5              6     49543.0  0.0001211    0.0001211
   7 │     6              6     49951.0  0.000120111  0.000120111
   8 │     7              2     49758.0  4.01937e-5   4.01937e-5
   9 │     8              2     49728.0  4.0218e-5    4.0218e-5
  10 │     9              3     49909.0  6.01076e-5   6.01076e-5
  11 │    10              2     49051.0  4.07731e-5   4.07731e-5
  12 │    11              1     49233.0  2.03114e-5   2.03114e-5

julia> transform(
           df,
           [:female_pop, :female_deaths] => ByRow(
               (x, y) -> y / (x + 0.5 * y)
           ) => :cond_prob2
       )
12×5 DataFrame
 Row │ age    female_deaths  female_pop  cond_prob    cond_prob2
     │ Int64  Int64          Float64     Float64      Float64
─────┼────────────────────────────────────────────────────────────
   1 │     0            136     48283.0  0.00281276   0.00281276
   2 │     1              8     47195.0  0.000169495  0.000169495
   3 │     2              3     48140.0  6.23163e-5   6.23163e-5
   4 │     3              5     50017.0  9.9961e-5    9.9961e-5
   5 │     4              2     49189.0  4.06587e-5   4.06587e-5
   6 │     5              6     49543.0  0.0001211    0.0001211
   7 │     6              6     49951.0  0.000120111  0.000120111
   8 │     7              2     49758.0  4.01937e-5   4.01937e-5
   9 │     8              2     49728.0  4.0218e-5    4.0218e-5
  10 │     9              3     49909.0  6.01076e-5   6.01076e-5
  11 │    10              2     49051.0  4.07731e-5   4.07731e-5
  12 │    11              1     49233.0  2.03114e-5   2.03114e-5

With DataFramesMeta.jl it would be:

julia> @rtransform(
           df,
           :cond_prob2 =   :female_deaths / (:female_pop + 0.5 * :female_deaths)
       )
12×5 DataFrame
 Row │ age    female_deaths  female_pop  cond_prob    cond_prob2
     │ Int64  Int64          Float64     Float64      Float64
─────┼────────────────────────────────────────────────────────────
   1 │     0            136     48283.0  0.00281276   0.00281276
   2 │     1              8     47195.0  0.000169495  0.000169495
   3 │     2              3     48140.0  6.23163e-5   6.23163e-5
   4 │     3              5     50017.0  9.9961e-5    9.9961e-5
   5 │     4              2     49189.0  4.06587e-5   4.06587e-5
   6 │     5              6     49543.0  0.0001211    0.0001211
   7 │     6              6     49951.0  0.000120111  0.000120111
   8 │     7              2     49758.0  4.01937e-5   4.01937e-5
   9 │     8              2     49728.0  4.0218e-5    4.0218e-5
  10 │     9              3     49909.0  6.01076e-5   6.01076e-5
  11 │    10              2     49051.0  4.07731e-5   4.07731e-5
  12 │    11              1     49233.0  2.03114e-5   2.03114e-5
1 Like

Is there anything wrong with this (imho clearer) base DataFrames version:

transform!(df, [:female_pop, :female_deaths] => ((p, d) -> qₓ(p, d)) => :cond_prop)
1 Like

It is OK. Also OK is:

transform(df, [:female_pop, :female_deaths] => qₓ => :cond_prop2)

or

transform(df, [:female_pop, :female_deaths] => ByRow(qₓ) => :cond_prop2)

but I thought OP wanted to understand how to use anonymous functions.

2 Likes

Oh wow, I didn’t know that the first one works, i.e. that multi-arg functions essentially get called as f(vec_of_colnames...) (if I understand correctly)

You understand correctly :).

And if you do AsTable(cols) => fun then fun gets only one positional argument that is a NamedTuple of selected columns. See also Any and all reduction of rows in DataFrames.jl | Blog by Bogumił Kamiński for examples of some more advanced uses of AsTable in source columns specification.

Ah, I knew I should have read your new post when I saw it on Friday!