Primary key in DataFrame

I use DataFrame to hold data where a set of columns is considered the primary key. So only one record should be present for each primary key. But I suppose in DataFrame such condition cannot be enforced. So for example the following code where person and town would represent the primary key and time the queried value :

using DataFrames, DataFramesMeta
using Dates


df2 = DataFrame(person = ["John", "Nick", "Mary", "Mary"], 
                time = [DateTime("20150101","yyyymmdd"), 
                    DateTime("20150101","yyyymmdd"), 
                    DateTime("20150201","yyyymmdd"),
                    DateTime("20150601","yyyymmdd")  ], 
                town = ["Brisbane", "Perth","Wollongong", "Wollongong"]);


e = df2[(df2.person .== "Mary") .& (df2.town .== "Wollongong"), :time]
show(e)

Suppose that person visiting the same town twice is impossible. I ask when Mary visited Wollongong and expect a single date but get an array if many records were originally put into df2.

Is it better to use a NamedArray or Dict instead of Dataframe to enforce records with unique primary keys? In Pandas there is MultiIndex, which I suppose can achieve something like this.

2 Likes

Just combine(groupby(df2, [:person, :town]), :time => first => :time), or whatever alternative function to decide which time entry you want to see.

You’re right that enforcing this doesn’t work in DataFrames, and you should probably use something else if this is of crucial importance. Deciding what exactly that might be will probably require more context - in your example you say that visiting the same town twice is impossible, but clearly in the data you posted it is possible, so one would need to consider why this happens to figure out what to do about it.

4 Likes

Would it be useful to define a function that only adds a new row to the dataframe if this has unique (new) primary keys (person and town), otherwise the existing row would be overwritten?

1 Like