I’d like to propose two added functions to the merge
function. I’m not sure if this needs to be in core language or somewhere like in DataFrames, but as long as merge
is in core, seems reasonable to put here (though input welcome!).
In particular, I’d like to propose the following two options be offered as optional keywords:
-
validate
: duplicates the functionality of validate
keyword in the pandas
merge function. Accepts "1:1"
, "1:m"
, "m:1"
, and "m:m"
, and raises an exception if the merge is not 1 to 1, 1 to many, many to 1, or many to many (respectively).
-
indicator
: duplicates functionality of indicator
keyword in pandas
[merge] function. If True
, adds column to returned object which records whether resulting row has data from both
datasets, from the left_only
, or the right_only
(or if we’d prefer numerics for generality, 1, 2, and 3)
(Both are actually replications of behavior from Stata)
Personally, I find these exceedingly value when working with real world data, as there’s no place problems become more evident than in merges, and it gets exhausting writing code the replicates these functionalities every time I merge (especially the indicator
command).
I don’t have anything super useful to say on the main topic. But thought I’d mention that functions operating on types in a package should be defined in that package. It doesn’t matter that merge
is defined in Base - the DataFrames package can extend that function with new methods.
I only bring this up because I’m guessing this is news to you from your 2nd sentence, and this confused me for a little while coming from python, but it turns out to be one of the great things about julia!
2 Likes
Right! Great point, thanks. I guess I should say “…since it is defined in Base and could be useful outside DataFrames as well”. I mostly use it in dataframes (where I go to work with dirty real-world tabular data), but curious if more generally useful.
Probably more efficient to return this as two bitvectors, which would only be 2 bits (instead of 8? 16? 32? 64? to store a number (that also makes it fast/easy to do things like count how many rows were left_only, right_only, or mixed)
I think this is DataFrames
-specific, so maybe you could open an issue (or a PR) for that package.
I don’t think that string arguments for selecting behavior is idiomatic for Julia. I would use symbols, or even a specific type, eg Validate(true, true)
for many to many.
2 Likes
Note that join
is much more flexible than merge
for DataFrames.
1 Like
@nalimilan Do you think these options might make sense being integrated into join
in DataFrames instead of merge
? (cc: @bkamins)
If I understand your indicator
functionality correctly, this is already available in DataFrames.join
, along with a few other varieties. From the DataFrames docs:
If you’re explicitly setting your join type, how often do you expect to need to validate
the result? Honest question, I just haven’t run into that situation.
Unfortunately, all the time.
I’m an empirical social scientist, and work with data from lots of sources (census, various government agencies, other researchers, etc.). Those data sources are often supposed to relate in certain ways (Everyone in dataset A should also be in B, but there should be people in B not in A), but they almost inevitably don’t. Just doing a Left
or Right
join would get only the people who fit the promised relationship, but I (a) want to know how many people didn’t merge right (so I usually do an Outer
join, then check the merge-state, and (b) when things go wrong, I want to look at the errors (so it’s nice to quickly query the people who are in A but not in B, in the example above).
I recognize this is not a problem that many CS people (with nice, single-source datasets) or people in the private sector (working with data from a single source, or from a SQL database that has inbuilt checks for these kinds of things) deal with much, but I think this is super common among social scientists.
2 Likes
Great, will open over there!