Primary key in DataFrame

Mastomaki · May 10, 2021, 11:43am

I use DataFrame to hold data where a set of columns is considered the primary key. So only one record should be present for each primary key. But I suppose in DataFrame such condition cannot be enforced. So for example the following code where person and town would represent the primary key and time the queried value :

using DataFrames, DataFramesMeta
using Dates


df2 = DataFrame(person = ["John", "Nick", "Mary", "Mary"], 
                time = [DateTime("20150101","yyyymmdd"), 
                    DateTime("20150101","yyyymmdd"), 
                    DateTime("20150201","yyyymmdd"),
                    DateTime("20150601","yyyymmdd")  ], 
                town = ["Brisbane", "Perth","Wollongong", "Wollongong"]);


e = df2[(df2.person .== "Mary") .& (df2.town .== "Wollongong"), :time]
show(e)

Suppose that person visiting the same town twice is impossible. I ask when Mary visited Wollongong and expect a single date but get an array if many records were originally put into df2.

Is it better to use a NamedArray or Dict instead of Dataframe to enforce records with unique primary keys? In Pandas there is MultiIndex, which I suppose can achieve something like this.

nilshg · May 10, 2021, 12:25pm

Just combine(groupby(df2, [:person, :town]), :time => first => :time), or whatever alternative function to decide which time entry you want to see.

You’re right that enforcing this doesn’t work in DataFrames, and you should probably use something else if this is of crucial importance. Deciding what exactly that might be will probably require more context - in your example you say that visiting the same town twice is impossible, but clearly in the data you posted it is possible, so one would need to consider why this happens to figure out what to do about it.

rafael.guerra · May 10, 2021, 10:55pm

Would it be useful to define a function that only adds a new row to the dataframe if this has unique (new) primary keys (person and town), otherwise the existing row would be overwritten?

Topic		Replies	Views
How would I update a DataFrame with new data and delete old data so I can maintain the N most recent rows per person General Usage	2	374	July 1, 2020
Fast way of selecting rows in DataFrames / csv-file with primary keys Performance dataframes	4	937	July 25, 2022
Enforcing Schema on Data Frame Passed as Function Argument General Usage question , dataframes , function	7	1188	October 27, 2020
Selecting arbitrary columns from DataFrameRow Data dataframes	2	340	November 8, 2022
Delete duplicate rows in a DataFrame New to Julia dataframes	10	6070	June 22, 2023

Primary key in DataFrame

Related topics