How to group by multiple data types?

Hello all,

I’m a bit new to the Julia programming language and haven’t been able to find an answer that solves my problem. I have the following dataframe:

10×5 DataFrame
 Row │ DATE    TOPIC_I    TOPIC_J   JOINT_PROB    DOC_COUNT 
         │ Date      String15     String15    Any                     Any       
──┼─────────────────────────────────────────
   1 │ 2000-09-01  TOPIC_153  TOPIC_87  0.03806723138    979
   2 │ 2000-09-01  TOPIC_81   TOPIC_87  0.01825187194    979
   3 │ 2000-09-01  TOPIC_249  TOPIC_87  0.01616933848    979
   4 │ 2000-09-01  TOPIC_124  TOPIC_87  0.006607188145   979
   5 │ 2000-09-01  TOPIC_140  TOPIC_87  0.0008916937195  979
   6 │ 2000-09-01  TOPIC_101  TOPIC_87  0.001341542903   979
   7 │ 2000-09-01  TOPIC_89   TOPIC_87  0.07842244991    979
   8 │ 2000-09-01  TOPIC_233  TOPIC_87  0.01956784903    979
   9 │ 2000-09-01  TOPIC_144  TOPIC_87  0.01501348474    979
  10 │ 2000-09-01  TOPIC_201  TOPIC_87  0.007407990334   979

I am trying to group by the DATE and TOPIC_I rows, sum the JOINT_PROB rows and take the average of the DOC_COUNT rows. I have implemented the code below:

# Convert the joint probabilities column and document count column to the correct types.
stuff = [typeof(x) for x in probabilities_data[!, :JOINT_PROB]]
println(unique(stuff))

probabilities_data[!, :JOINT_PROB] = [typeof(x) == String ? tryparse(Float64,x) : x for x in probabilities_data[!, :JOINT_PROB]]

stuff = [typeof(x) for x in probabilities_data[!, :JOINT_PROB]]
println(unique(stuff))

p_i_group = groupby(probabilities_data, [:DATE, :TOPIC_I])
pi_df = combine(p_i_group, :TOPIC_J => sum => :PROB_I)

I countinue to get the following error related to the last line of code above:

TaskFailedException:
MethodError: no method matching +(::String15, ::String15)

As far as I can tell my syntax is correct. Can someone help me find what I am missing?

Those are string columns… perhaps you mean to be summing the probabilities?

1 Like

My apologies. You are correct. I want to sum the “JOINT_PROB” column. I updated the question to reflect this.

1 Like

Your JOINT_PROB and DOC_COUNT have both eltype Any which signals that there is a risk that they do not contain only numbers. If they contain only numbers then the following will work:

using Statistics
p_i_group = groupby(probabilities_data, [:DATE, :TOPIC_I])
pi_df = combine(p_i_group, :JOINT_PROB => sum => :PROB_I, :DOC_COUNT => mean => :MEAN_COUNT)

the easiest way to check if your column contains only numbers is to do e.g. float.(probabilities_data.JOINT_PROB). If this errors this means that you have some bad data in your columns.

1 Like

It seems there was an issue with the input data. After fixing that and setting the column to “JOINT_PROB” based on your and @tbeason feedback it looks like the code is working. Thank you both for your help!