I have the following dataframe:
[ Info: Displaying top ten rows of the pij_df dataframe...
┌ Info: 10×5 DataFrame
│ Row │ MONTH TOPIC_I TOPIC_J JOINT_PROB_SUM DOC_COUNT_SUM
│ │ Int64 String15 String15 Float64 Int64
│ ─────┼───────────────────────────────────────────────────────────
│ 1 │ 12 TOPIC_153 TOPIC_87 0.0380672 979
│ 2 │ 12 TOPIC_81 TOPIC_87 0.0182519 979
│ 3 │ 12 TOPIC_249 TOPIC_87 0.0161693 979
│ 4 │ 12 TOPIC_124 TOPIC_87 0.00660719 979
│ 5 │ 12 TOPIC_140 TOPIC_87 0.000891694 979
│ 6 │ 12 TOPIC_101 TOPIC_87 0.00134154 979
│ 7 │ 12 TOPIC_89 TOPIC_87 0.0784224 979
│ 8 │ 12 TOPIC_233 TOPIC_87 0.0195678 979
│ 9 │ 12 TOPIC_144 TOPIC_87 0.0150135 979
└ 10 │ 12 TOPIC_201 TOPIC_87 0.00740799 979
[ Info: The dimensions for the pij_df: (16812500, 5)
I am trying to generate a new column “PROB_I_J” from the columns “JOINT_PROB_SUM” and “DOC_COUNT_SUM”. Each row in “PROB_I_J” should be the value for “JOINT_PROB_SUM” divided by the “DOC_COUNT_SUM”.
I am using the following DataFramesMeta macro:
@transform!(pij_df, :PROB_I_J = :JOINT_PROB_SUM / :DOC_COUNT_SUM)
I am receiving an OutOfMemoryError() from running the transform macro. I know the dataframe is quite large, but I am suspecting there is something wrong with the transform macro. Is the code above doing what I want it to do? I’ve generally used dataframes and dataframesmeta to manipulate data but I’m wondering if this is less efficient…