How to group by multiple data types?

Hello all,

I’m a bit new to the Julia programming language and haven’t been able to find an answer that solves my problem. I have the following dataframe:

10×5 DataFrame
 Row │ DATE    TOPIC_I    TOPIC_J   JOINT_PROB    DOC_COUNT 
         │ Date      String15     String15    Any                     Any       
──┼─────────────────────────────────────────
   1 │ 2000-09-01  TOPIC_153  TOPIC_87  0.03806723138    979
   2 │ 2000-09-01  TOPIC_81   TOPIC_87  0.01825187194    979
   3 │ 2000-09-01  TOPIC_249  TOPIC_87  0.01616933848    979
   4 │ 2000-09-01  TOPIC_124  TOPIC_87  0.006607188145   979
   5 │ 2000-09-01  TOPIC_140  TOPIC_87  0.0008916937195  979
   6 │ 2000-09-01  TOPIC_101  TOPIC_87  0.001341542903   979
   7 │ 2000-09-01  TOPIC_89   TOPIC_87  0.07842244991    979
   8 │ 2000-09-01  TOPIC_233  TOPIC_87  0.01956784903    979
   9 │ 2000-09-01  TOPIC_144  TOPIC_87  0.01501348474    979
  10 │ 2000-09-01  TOPIC_201  TOPIC_87  0.007407990334   979

I am trying to group by the DATE and TOPIC_I rows, sum the JOINT_PROB rows and take the average of the DOC_COUNT rows. I have implemented the code below:

# Convert the joint probabilities column and document count column to the correct types.
stuff = [typeof(x) for x in probabilities_data[!, :JOINT_PROB]]
println(unique(stuff))

probabilities_data[!, :JOINT_PROB] = [typeof(x) == String ? tryparse(Float64,x) : x for x in probabilities_data[!, :JOINT_PROB]]

stuff = [typeof(x) for x in probabilities_data[!, :JOINT_PROB]]
println(unique(stuff))

p_i_group = groupby(probabilities_data, [:DATE, :TOPIC_I])
pi_df = combine(p_i_group, :TOPIC_J => sum => :PROB_I)

I countinue to get the following error related to the last line of code above:

TaskFailedException:
MethodError: no method matching +(::String15, ::String15)

As far as I can tell my syntax is correct. Can someone help me find what I am missing?

Those are string columns… perhaps you mean to be summing the probabilities?

My apologies. You are correct. I want to sum the “JOINT_PROB” column. I updated the question to reflect this.

Your JOINT_PROB and DOC_COUNT have both eltype Any which signals that there is a risk that they do not contain only numbers. If they contain only numbers then the following will work:

using Statistics
p_i_group = groupby(probabilities_data, [:DATE, :TOPIC_I])
pi_df = combine(p_i_group, :JOINT_PROB => sum => :PROB_I, :DOC_COUNT => mean => :MEAN_COUNT)

the easiest way to check if your column contains only numbers is to do e.g. float.(probabilities_data.JOINT_PROB). If this errors this means that you have some bad data in your columns.

It seems there was an issue with the input data. After fixing that and setting the column to “JOINT_PROB” based on your and @tbeason feedback it looks like the code is working. Thank you both for your help!