How do I create a new dataframe column by dividing individual elements in two dataframe columns?

I have the following dataframe:

[ Info: Displaying top ten rows of the pij_df dataframe...
┌ Info: 10×5 DataFrame
│  Row │ MONTH  TOPIC_I    TOPIC_J   JOINT_PROB_SUM  DOC_COUNT_SUM
│      │ Int64  String15   String15  Float64         Int64
│ ─────┼───────────────────────────────────────────────────────────
│    1 │    12  TOPIC_153  TOPIC_87     0.0380672              979
│    2 │    12  TOPIC_81   TOPIC_87     0.0182519              979
│    3 │    12  TOPIC_249  TOPIC_87     0.0161693              979
│    4 │    12  TOPIC_124  TOPIC_87     0.00660719             979
│    5 │    12  TOPIC_140  TOPIC_87     0.000891694            979
│    6 │    12  TOPIC_101  TOPIC_87     0.00134154             979
│    7 │    12  TOPIC_89   TOPIC_87     0.0784224              979
│    8 │    12  TOPIC_233  TOPIC_87     0.0195678              979
│    9 │    12  TOPIC_144  TOPIC_87     0.0150135              979
└   10 │    12  TOPIC_201  TOPIC_87     0.00740799             979
[ Info: The dimensions for the pij_df: (16812500, 5)

I am trying to generate a new column “PROB_I_J” from the columns “JOINT_PROB_SUM” and “DOC_COUNT_SUM”. Each row in “PROB_I_J” should be the value for “JOINT_PROB_SUM” divided by the “DOC_COUNT_SUM”.

I am using the following DataFramesMeta macro:

@transform!(pij_df, :PROB_I_J = :JOINT_PROB_SUM / :DOC_COUNT_SUM)

I am receiving an OutOfMemoryError() from running the transform macro. I know the dataframe is quite large, but I am suspecting there is something wrong with the transform macro. Is the code above doing what I want it to do? I’ve generally used dataframes and dataframesmeta to manipulate data but I’m wondering if this is less efficient…

do

@rtransform!(pij_df, :PROB_I_J = :JOINT_PROB_SUM / :DOC_COUNT_SUM)

or

@transform!(pij_df, :PROB_I_J = :JOINT_PROB_SUM ./ :DOC_COUNT_SUM)
2 Likes

This works! @bkamins can you elaborate on what my original code was doing? I’ve reviewed the docs a few times and it isn’t clear to me what happens in the original transform. Is performing the calculation by mapping the first element in column1 with every single element in column2?

Without any macros:

pij_df.PROB_I_J = pij_df.JOINT_PROB_SUM ./ pij_df.DOC_COUNT
1 Like

Your original code is creating a matrix, e.g.:

julia> [1, 2, 3] / [4, 5, 6]
3×3 Matrix{Float64}:
 0.0519481  0.0649351  0.0779221
 0.103896   0.12987    0.155844
 0.155844   0.194805   0.233766

which cannot be stored in a column of a data frame.

1 Like