Dataframes difficult joins

I am having difficulty in understanding on how to link up two dataframes to create another dataframe based on multiple criteria. Apologies, if the code below seem too long-most of it is to create the two dataframes in question.

I have two dataframes, namely df1 and df2. In df2, I have some numerical data that has column names as dates. I want to create another dataframe (df3) that will have all the data as in df2 but numerical data updated by applyling a formula. This should be based on:

  • columns RANK and LEVEl in the two dataframes matching up
  • GRADE in df2 corresponds to columns named β€œOne”, β€œTwo” and β€œThree” in df1
  • Formula is simply: the relevant factor in df1 (based on RANK, LEVEL, GRADE, Time matching up with df2)value in df2df_Time calculated (based on dates in df2)
  • if df_Time > 5, then it should apply the factor for Time 5 (as df1 only contains data up to Time 5). For example, for the data item in the last column and first row, this will give the right result
    0.678235743 470.935.498630 (where 5.49 is the time corresponding to the date in that column, 470.93 is the numerical data in df2 and 0.67 is the factor from df1)
    Similarly, if time in df2 is 3.5 then this requires apply the factor corresponding to time 4 in df1.
  • if VEHICLE in df2 is BIKE then the formula wouldn’t apply and the original data in df2 would go in df3

The code below generates the two dataframes, but I am not sure how to produce df3 based on the criteria above. Any help will be appreciated.

using DataFrames, Dates
function create_data()
    Time = repeat(collect(1:5),outer =4)
    CAT1= Vector{String}(undef,20)
    for i in 1:20
        i <= 10 ? CAT1[i] = "A" : CAT1[i] = "B"  
    end

    CAT2 =  Vector{String}(undef,20)
    for i in 1:20
        i <= 10 ? CAT2[i] = "EASY" : CAT2[i] = "HARD"
    end

    temp_data = rand(20,3)
    df1 = DataFrame(Time = Time,RANK=CAT1,LEVEL=CAT2,One=temp_data[1],Two=temp_data[2],Three=temp_data[3])

    Date_1 = Vector{Date}(undef,6)
    for i in 1:6
        Date_1[i] = Date(2020+i,06,30)
    end

    TYPE = Vector{String}(undef,20)
    for i in 1:20
        i <= 17 ? TYPE[i] = "CAR" : TYPE[i] = "BIKE"
    end
    GRADE = Vector{String}(undef,20)
    for i in 1:20
        i <= 10 ? GRADE[i] = "One" : GRADE[i] = "Two"
    end

    df2_Data= rand(20,6)*1000
    df2_temp = DataFrame(Symbol("30/06/2020") =>df2_Data[:,1],Symbol("30/06/2021") =>df2_Data[:,2],
    Symbol("30/06/2022") =>df2_Data[:,3],Symbol("30/06/2023") =>df2_Data[:,4],
    Symbol("30/06/2024") =>df2_Data[:,5],Symbol("30/06/2025") =>df2_Data[:,6])
    df2_temp2= DataFrame(ID=collect(1:20),GRADE=GRADE,RANK=CAT1,LEVEL=CAT2,VEHICLE=TYPE)
    df2 = hcat(df2_temp2,df2_temp)
return df1, df2
end


df1,df2= create_data()
function calc_time(df2)
# Calculate time as year fractions based on dates in df2
    dates_string = names(select(df2, Not(Between(:ID, :VEHICLE))))
    dates_formatted = Date.(dates_string, Dates.DateFormat("dd/mm/yyyy"))
    start_date = Date(2020,01,01)
    n_days = dates_formatted .- start_date
    df2_Time = Dates.value.(n_days) / 365

return df2_Time
end

    
df2_Time = calc_time(df2)
'''

Unfortunately, I’m unable to follow the part after β€œFormula is simply”. However, I think that I can help with the bits before that. From

I understand that you at least want to join GRADE and One, Two and Three. That is possible via

julia> df1_stacked = stack(df1, [:One, :Two, :Three], variable_name = :GRADE);

julia> first(df1_stacked, 5)
5Γ—5 DataFrame
 Row β”‚ Time   RANK    LEVEL   NUMBER  value
     β”‚ Int64  String  String  String  Float64
─────┼─────────────────────────────────────────
   1 β”‚     1  A       EASY    One     0.588641
   2 β”‚     2  A       EASY    One     0.588641
   3 β”‚     3  A       EASY    One     0.588641
   4 β”‚     4  A       EASY    One     0.588641
   5 β”‚     5  A       EASY    One     0.588641

Then, you can simply join those

julia> first(innerjoin(df1_stacked, df2; on = :GRADE, makeunique=true), 5)
5Γ—15 DataFrame
 Row β”‚ Time   RANK    LEVEL   GRADE   value     ID     RANK_1  LEVEL_1  VEHICLE  30/06/2020  30/06/2021  30/06/2022  30/06/2023  30/06/2024  30/06/2025
     β”‚ Int64  String  String  String  Float64   Int64  String  String   String   Float64     Float64     Float64     Float64     Float64     Float64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚     1  A       EASY    One     0.588641      1  A       EASY     CAR        212.189      493.931     363.206     317.697     581.577     573.171
   2 β”‚     1  A       EASY    One     0.588641      2  A       EASY     CAR        572.872      378.893     940.904     881.075     464.886     430.43
   3 β”‚     1  A       EASY    One     0.588641      3  A       EASY     CAR         55.1096     541.963     817.011     923.08      380.368     131.929
   4 β”‚     1  A       EASY    One     0.588641      4  A       EASY     CAR        512.541      297.032     344.694     153.382     164.165     655.449
   5 β”‚     1  A       EASY    One     0.588641      5  A       EASY     CAR        512.258      658.292     610.753     888.232     861.228     581.352

Could you explain what transformations you would like to see after this?

By the way, you can simplify a lot of the code by using list comprehensions. For example,

CAT1= Vector{String}(undef,20)
for i in 1:20
    i <= 10 ? CAT1[i] = "A" : CAT1[i] = "B"  
end

can be replaced by

CAT1 = [i <= 10 ? "A" : "B" for i in 1:20]
1 Like

Thank you @rikh
I just realised that the formula above is not clear.

The formula will apply as:
VALUE in df1^(time based on date in column of df2)*VALUE in df2

  • If time based on date in column of df2 > 5 (as calculated using the function calc_time(df2)), then the VALUE at time 5 will apply in this formula.

  • If VEHICLE is BIKE, then values in df3 remain same as df2.

Assume that df1 is given as

df2 is given by

df3 is built by applying the formulae as shown below


Note: For VEHICLE type BIKE, the values remain unchanged

Shouldn’t the innerjoin in your code be based on the following?

criteria = ["GRADE","RANK","LEVEL"]
innerjoin(df1_stacked, df2; on = criteria , makeunique=true)

Once you do the join (as rikh suggested, and I think you are right, it would be all 3 criteria), then you would calculate the values.

So initially after the join, you would have a values column from df1 corresponding to the grade and each date column from df2. You can then write a function, something like

function convert_joined_columns(date, vehicle, grade_value, multiplier)
    if vehicle == "BIKE"
        return multiplier
    end
    grade_value ^ Dates.value(date) * multiplier # I am not sure how you came up with the date values in this function
end

for col in names(df3)[5:end] # grab the colum names
    df3[!, col] = convert_joined_columns.(col, df3.VEHICLE, df3[:, "grade_value"], df3[:, col])
end

That is a pretty simple way to do it, and I am sure some people would come up with a better way. I think that does what you are looking for though.