Selecting arbitrary columns from DataFrameRow

tom-plaa · November 8, 2022, 6:18pm

EDIT: Disregard this, I don’t know why I wasn’t getting the results I expected before…

I want to replicate finding duplicate composite primary keys in an sql database to some arbitrary dataframe.

Let’s set up some dummy dataframes (borrowed from some other topic here):


function grades_2020()
    name = ["Sally", "Bob", "Alice", "Hank"]
    grade_2020 = [1, 5, 8.5, 4]
    extra_col = [0, 0, 0, 0]
    DataFrame(; name, grade_2020, extra_col)
end

function grades_2020_alt()
    name = ["Sally", "Bob", "Alice", "Hank", "Extra1", "Extra2"]
    grade_2020 = [1, 5, 8.5, 4, 1000, 2000]
    extra_col = [0, 0, 0, 0, 0, 0]
    DataFrame(; name, grade_2020, extra_col)
end

test_df = grades_2020()
test_df_alt = grades_2020_alt()

So let’s imagine our composite primary key would be ["name", "grade_2020"]. I want to filter out duplicates and get the full rows (including the extra column), but I do not want to explicitly write the column names in the filtering algorithm, I want a function that receives a vector of primary key names, such as pkey_cols = ["name", "grades_2020"] for this particular instance.

My attempt only allows me to retrieve the columns that make part of the primary key because I cannot select the relevant columns from a DataFrameRow object like row["name", "grade_2020"] or row[["name", "grade_2020"]].
EDIT: row[["name", "grade_2020"]] actually DOES work.

function test_filter(dataindb, datatoinsert, pkey_cols)
    pkeys_indb = dataindb[!, pkey_cols]
    pkeys_indata = datatoinsert[!, pkey_cols]    
    return filter(x -> !in(x, eachrow(pkeys_indb)), eachrow(pkeys_indata)) |> DataFrame
end

Which results in:

julia> test_filter(test_df, test_df_alt, ["name", "grade_2020"])
2×2 DataFrame
 Row │ name    grade_2020 
     │ String  Float64    
─────┼────────────────────
   1 │ Extra1      1000.0
   2 │ Extra2      2000.0

I wish I could instead do something like:
(EDIT: this DOES actually work)

function test_filter(dataindb, datatoinsert, pkey_cols)
    pkeys_indb = dataindb[!, pkey_cols]
    return filter(x -> !in(x[pkeys_cols], eachrow(pkeys_indb)), eachrow(datatoinsert)) |> DataFrame
end

EDIT: and the output result (as intended) below

julia> test_filter(test_df, test_df_alt, ["name", "grade_2020"])
2×3 DataFrame
 Row │ name    grade_2020  extra_col 
     │ String  Float64     Int64     
─────┼───────────────────────────────
   1 │ Extra1      1000.0          0
   2 │ Extra2      2000.0          0

I feel like this should have been something easy, but I’m surprisingly stuck.

pdeffebach · November 8, 2022, 7:01pm

Wait so you have solved your problem? You are correct that you can use pkeys to select parts of a DataFrameRow programatically.

tom-plaa · November 8, 2022, 7:02pm

Apparently, I must have mistyped something when I was trying to index DataFrameRow on a vector of strings. This was embarassing…

Topic		Replies	Views
Fast way of selecting rows in DataFrames / csv-file with primary keys Performance dataframes	4	940	July 25, 2022
Why is it so complicated to access a row in a DataFrame? General Usage dataframes	13	2003	August 25, 2023
Dataframe Filter New to Julia question , dataframes	6	6372	March 26, 2022
Is it possible, that DataFrame row has a name, like colums have? For instance: df[:GR,:col] New to Julia dataframes	2	424	January 17, 2020
DataFrames: obtaining the subset of rows by a set of values New to Julia dataframes	45	24043	April 27, 2024

Selecting arbitrary columns from DataFrameRow

Related topics