Selecting arbitrary columns from DataFrameRow

EDIT: Disregard this, I don’t know why I wasn’t getting the results I expected before…

I want to replicate finding duplicate composite primary keys in an sql database to some arbitrary dataframe.

Let’s set up some dummy dataframes (borrowed from some other topic here):


function grades_2020()
    name = ["Sally", "Bob", "Alice", "Hank"]
    grade_2020 = [1, 5, 8.5, 4]
    extra_col = [0, 0, 0, 0]
    DataFrame(; name, grade_2020, extra_col)
end

function grades_2020_alt()
    name = ["Sally", "Bob", "Alice", "Hank", "Extra1", "Extra2"]
    grade_2020 = [1, 5, 8.5, 4, 1000, 2000]
    extra_col = [0, 0, 0, 0, 0, 0]
    DataFrame(; name, grade_2020, extra_col)
end

test_df = grades_2020()
test_df_alt = grades_2020_alt()

So let’s imagine our composite primary key would be ["name", "grade_2020"]. I want to filter out duplicates and get the full rows (including the extra column), but I do not want to explicitly write the column names in the filtering algorithm, I want a function that receives a vector of primary key names, such as pkey_cols = ["name", "grades_2020"] for this particular instance.

My attempt only allows me to retrieve the columns that make part of the primary key because I cannot select the relevant columns from a DataFrameRow object like row["name", "grade_2020"] or row[["name", "grade_2020"]].
EDIT: row[["name", "grade_2020"]] actually DOES work.

function test_filter(dataindb, datatoinsert, pkey_cols)
    pkeys_indb = dataindb[!, pkey_cols]
    pkeys_indata = datatoinsert[!, pkey_cols]    
    return filter(x -> !in(x, eachrow(pkeys_indb)), eachrow(pkeys_indata)) |> DataFrame
end

Which results in:

julia> test_filter(test_df, test_df_alt, ["name", "grade_2020"])
2×2 DataFrame
 Row │ name    grade_2020 
     │ String  Float64    
─────┼────────────────────
   1 │ Extra1      1000.0
   2 │ Extra2      2000.0

I wish I could instead do something like:
(EDIT: this DOES actually work)

function test_filter(dataindb, datatoinsert, pkey_cols)
    pkeys_indb = dataindb[!, pkey_cols]
    return filter(x -> !in(x[pkeys_cols], eachrow(pkeys_indb)), eachrow(datatoinsert)) |> DataFrame
end

EDIT: and the output result (as intended) below

julia> test_filter(test_df, test_df_alt, ["name", "grade_2020"])
2×3 DataFrame
 Row │ name    grade_2020  extra_col 
     │ String  Float64     Int64     
─────┼───────────────────────────────
   1 │ Extra1      1000.0          0
   2 │ Extra2      2000.0          0

I feel like this should have been something easy, but I’m surprisingly stuck.

Wait so you have solved your problem? You are correct that you can use pkeys to select parts of a DataFrameRow programatically.

Apparently, I must have mistyped something when I was trying to index DataFrameRow on a vector of strings. This was embarassing… :rofl: