Two Feature Requests for `merge`

nickeubank · May 25, 2018, 1:41am

I’d like to propose two added functions to the merge function. I’m not sure if this needs to be in core language or somewhere like in DataFrames, but as long as merge is in core, seems reasonable to put here (though input welcome!).

In particular, I’d like to propose the following two options be offered as optional keywords:

validate: duplicates the functionality of validate keyword in the pandas merge function. Accepts "1:1", "1:m", "m:1", and "m:m", and raises an exception if the merge is not 1 to 1, 1 to many, many to 1, or many to many (respectively).
indicator: duplicates functionality of indicator keyword in pandas [merge] function. If True, adds column to returned object which records whether resulting row has data from both datasets, from the left_only, or the right_only (or if we’d prefer numerics for generality, 1, 2, and 3)

(Both are actually replications of behavior from Stata)

Personally, I find these exceedingly value when working with real world data, as there’s no place problems become more evident than in merges, and it gets exhausting writing code the replicates these functionalities every time I merge (especially the indicator command).

kevbonham · May 25, 2018, 1:54am

I don’t have anything super useful to say on the main topic. But thought I’d mention that functions operating on types in a package should be defined in that package. It doesn’t matter that merge is defined in Base - the DataFrames package can extend that function with new methods.

I only bring this up because I’m guessing this is news to you from your 2nd sentence, and this confused me for a little while coming from python, but it turns out to be one of the great things about julia!

nickeubank · May 25, 2018, 1:56am

Right! Great point, thanks. I guess I should say “…since it is defined in Base and could be useful outside DataFrames as well”. I mostly use it in dataframes (where I go to work with dirty real-world tabular data), but curious if more generally useful.

ScottPJones · May 25, 2018, 2:06am

Probably more efficient to return this as two bitvectors, which would only be 2 bits (instead of 8? 16? 32? 64? to store a number (that also makes it fast/easy to do things like count how many rows were left_only, right_only, or mixed)

Tamas_Papp · May 25, 2018, 5:28am

I think this is DataFrames-specific, so maybe you could open an issue (or a PR) for that package.

I don’t think that string arguments for selecting behavior is idiomatic for Julia. I would use symbols, or even a specific type, eg Validate(true, true) for many to many.

nalimilan · May 25, 2018, 8:28am

Note that join is much more flexible than merge for DataFrames.

nickeubank · May 25, 2018, 2:13pm

@nalimilan Do you think these options might make sense being integrated into join in DataFrames instead of merge? (cc: @bkamins)

jkbest2 · May 25, 2018, 7:11pm

If I understand your indicator functionality correctly, this is already available in DataFrames.join, along with a few other varieties. From the DataFrames docs:

There are seven kinds of joins supported by the DataFrames package:

Inner: The output contains rows for values of the key that exist in both the first (left) and second (right) arguments to join.

Left: The output contains rows for values of the key that exist in the first (left) argument to join, whether or not that value exists in the second (right) argument.

Right: The output contains rows for values of the key that exist in the second (right) argument to join, whether or not that value exists in the first (left) argument.

Outer: The output contains rows for values of the key that exist in the first (left) or second (right) argument to join.

Semi: Like an inner join, but output is restricted to columns from the first (left) argument to join.

Anti: The output contains rows for values of the key that exist in the first (left) but not the second (right) argument to join. As with semi joins, output is restricted to columns from the first (left) argument.

Cross: The output is the cartesian product of rows from the first (left) and second (right) arguments to join.

If you’re explicitly setting your join type, how often do you expect to need to validate the result? Honest question, I just haven’t run into that situation.

bkamins · May 25, 2018, 8:11pm

DataFrame does not support merge; only merge! is supported which is kind of hcat but replacing duplicate columns;
as noted above join is a default function providing joining functionality; if you would find adding validate or indicator keyword arguments useful it would be best to make an issue in DataFrames.jl repository explaining the use cases and proposed functionality (actually I find that both could be useful in some cases but most of the time what join already has is probably enough);

nickeubank · May 25, 2018, 10:01pm

Unfortunately, all the time.

I’m an empirical social scientist, and work with data from lots of sources (census, various government agencies, other researchers, etc.). Those data sources are often supposed to relate in certain ways (Everyone in dataset A should also be in B, but there should be people in B not in A), but they almost inevitably don’t. Just doing a Left or Right join would get only the people who fit the promised relationship, but I (a) want to know how many people didn’t merge right (so I usually do an Outer join, then check the merge-state, and (b) when things go wrong, I want to look at the errors (so it’s nice to quickly query the people who are in A but not in B, in the example above).

I recognize this is not a problem that many CS people (with nice, single-source datasets) or people in the private sector (working with data from a single source, or from a SQL database that has inbuilt checks for these kinds of things) deal with much, but I think this is super common among social scientists.

nickeubank · May 25, 2018, 10:02pm

Great, will open over there!

Topic		Replies	Views
Merge two dataframes together Data	1	9904	November 14, 2020
Merge dataframes where one value is between two others General Usage question	7	11770	November 22, 2020
How to merge 2 dataframes (DataFrames.jl) General Usage question , dataframes	5	5186	July 9, 2021
Joining on a DataFrame on an argument and column New to Julia question , dataframes	3	89	January 11, 2025
Fuzzy inexact merge General Usage	5	1528	July 29, 2020

Two Feature Requests for `merge`

Related topics