EDIT: Disregard this, I don’t know why I wasn’t getting the results I expected before…
I want to replicate finding duplicate composite primary keys in an sql database to some arbitrary dataframe.
Let’s set up some dummy dataframes (borrowed from some other topic here):
function grades_2020()
    name = ["Sally", "Bob", "Alice", "Hank"]
    grade_2020 = [1, 5, 8.5, 4]
    extra_col = [0, 0, 0, 0]
    DataFrame(; name, grade_2020, extra_col)
end
function grades_2020_alt()
    name = ["Sally", "Bob", "Alice", "Hank", "Extra1", "Extra2"]
    grade_2020 = [1, 5, 8.5, 4, 1000, 2000]
    extra_col = [0, 0, 0, 0, 0, 0]
    DataFrame(; name, grade_2020, extra_col)
end
test_df = grades_2020()
test_df_alt = grades_2020_alt()
So let’s imagine our composite primary key would be ["name", "grade_2020"]. I want to filter out duplicates and get the full rows (including the extra column), but I do not want to explicitly write the column names in the filtering algorithm, I want a function that receives a vector of primary key names, such as pkey_cols = ["name", "grades_2020"] for this particular instance.
My attempt only allows me to retrieve the columns that make part of the primary key because I cannot select the relevant columns from a DataFrameRow object like row["name", "grade_2020"] or row[["name", "grade_2020"]].
EDIT: row[["name", "grade_2020"]] actually DOES work.
function test_filter(dataindb, datatoinsert, pkey_cols)
    pkeys_indb = dataindb[!, pkey_cols]
    pkeys_indata = datatoinsert[!, pkey_cols]    
    return filter(x -> !in(x, eachrow(pkeys_indb)), eachrow(pkeys_indata)) |> DataFrame
end
Which results in:
julia> test_filter(test_df, test_df_alt, ["name", "grade_2020"])
2×2 DataFrame
 Row │ name    grade_2020 
     │ String  Float64    
─────┼────────────────────
   1 │ Extra1      1000.0
   2 │ Extra2      2000.0
I wish I could instead do something like:
(EDIT: this DOES actually work)
function test_filter(dataindb, datatoinsert, pkey_cols)
    pkeys_indb = dataindb[!, pkey_cols]
    return filter(x -> !in(x[pkeys_cols], eachrow(pkeys_indb)), eachrow(datatoinsert)) |> DataFrame
end
EDIT: and the output result (as intended) below
julia> test_filter(test_df, test_df_alt, ["name", "grade_2020"])
2×3 DataFrame
 Row │ name    grade_2020  extra_col 
     │ String  Float64     Int64     
─────┼───────────────────────────────
   1 │ Extra1      1000.0          0
   2 │ Extra2      2000.0          0
I feel like this should have been something easy, but I’m surprisingly stuck.