EDIT: Disregard this, I don’t know why I wasn’t getting the results I expected before…
I want to replicate finding duplicate composite primary keys in an sql database to some arbitrary dataframe.
Let’s set up some dummy dataframes (borrowed from some other topic here):
function grades_2020()
name = ["Sally", "Bob", "Alice", "Hank"]
grade_2020 = [1, 5, 8.5, 4]
extra_col = [0, 0, 0, 0]
DataFrame(; name, grade_2020, extra_col)
end
function grades_2020_alt()
name = ["Sally", "Bob", "Alice", "Hank", "Extra1", "Extra2"]
grade_2020 = [1, 5, 8.5, 4, 1000, 2000]
extra_col = [0, 0, 0, 0, 0, 0]
DataFrame(; name, grade_2020, extra_col)
end
test_df = grades_2020()
test_df_alt = grades_2020_alt()
So let’s imagine our composite primary key would be ["name", "grade_2020"]
. I want to filter out duplicates and get the full rows (including the extra column), but I do not want to explicitly write the column names in the filtering algorithm, I want a function that receives a vector of primary key names, such as pkey_cols = ["name", "grades_2020"]
for this particular instance.
My attempt only allows me to retrieve the columns that make part of the primary key because I cannot select the relevant columns from a DataFrameRow
object like row["name", "grade_2020"]
or row[["name", "grade_2020"]].
EDIT: row[["name", "grade_2020"]]
actually DOES work.
function test_filter(dataindb, datatoinsert, pkey_cols)
pkeys_indb = dataindb[!, pkey_cols]
pkeys_indata = datatoinsert[!, pkey_cols]
return filter(x -> !in(x, eachrow(pkeys_indb)), eachrow(pkeys_indata)) |> DataFrame
end
Which results in:
julia> test_filter(test_df, test_df_alt, ["name", "grade_2020"])
2×2 DataFrame
Row │ name grade_2020
│ String Float64
─────┼────────────────────
1 │ Extra1 1000.0
2 │ Extra2 2000.0
I wish I could instead do something like:
(EDIT: this DOES actually work)
function test_filter(dataindb, datatoinsert, pkey_cols)
pkeys_indb = dataindb[!, pkey_cols]
return filter(x -> !in(x[pkeys_cols], eachrow(pkeys_indb)), eachrow(datatoinsert)) |> DataFrame
end
EDIT: and the output result (as intended) below
julia> test_filter(test_df, test_df_alt, ["name", "grade_2020"])
2×3 DataFrame
Row │ name grade_2020 extra_col
│ String Float64 Int64
─────┼───────────────────────────────
1 │ Extra1 1000.0 0
2 │ Extra2 2000.0 0
I feel like this should have been something easy, but I’m surprisingly stuck.