Two Feature Requests for `merge`

I’d like to propose two added functions to the merge function. I’m not sure if this needs to be in core language or somewhere like in DataFrames, but as long as merge is in core, seems reasonable to put here (though input welcome!).

In particular, I’d like to propose the following two options be offered as optional keywords:

  • validate: duplicates the functionality of validate keyword in the pandas merge function. Accepts "1:1", "1:m", "m:1", and "m:m", and raises an exception if the merge is not 1 to 1, 1 to many, many to 1, or many to many (respectively).
  • indicator: duplicates functionality of indicator keyword in pandas [merge] function. If True, adds column to returned object which records whether resulting row has data from both datasets, from the left_only, or the right_only (or if we’d prefer numerics for generality, 1, 2, and 3)

(Both are actually replications of behavior from Stata)

Personally, I find these exceedingly value when working with real world data, as there’s no place problems become more evident than in merges, and it gets exhausting writing code the replicates these functionalities every time I merge (especially the indicator command).

I don’t have anything super useful to say on the main topic. But thought I’d mention that functions operating on types in a package should be defined in that package. It doesn’t matter that merge is defined in Base - the DataFrames package can extend that function with new methods.

I only bring this up because I’m guessing this is news to you from your 2nd sentence, and this confused me for a little while coming from python, but it turns out to be one of the great things about julia!

2 Likes

Right! Great point, thanks. I guess I should say “…since it is defined in Base and could be useful outside DataFrames as well”. I mostly use it in dataframes (where I go to work with dirty real-world tabular data), but curious if more generally useful.

Probably more efficient to return this as two bitvectors, which would only be 2 bits (instead of 8? 16? 32? 64? to store a number (that also makes it fast/easy to do things like count how many rows were left_only, right_only, or mixed)

I think this is DataFrames-specific, so maybe you could open an issue (or a PR) for that package.

I don’t think that string arguments for selecting behavior is idiomatic for Julia. I would use symbols, or even a specific type, eg Validate(true, true) for many to many.

2 Likes

Note that join is much more flexible than merge for DataFrames.

1 Like

@nalimilan Do you think these options might make sense being integrated into join in DataFrames instead of merge? (cc: @bkamins)

If I understand your indicator functionality correctly, this is already available in DataFrames.join, along with a few other varieties. From the DataFrames docs:

If you’re explicitly setting your join type, how often do you expect to need to validate the result? Honest question, I just haven’t run into that situation.

  • DataFrame does not support merge; only merge! is supported which is kind of hcat but replacing duplicate columns;
  • as noted above join is a default function providing joining functionality; if you would find adding validate or indicator keyword arguments useful it would be best to make an issue in DataFrames.jl repository explaining the use cases and proposed functionality (actually I find that both could be useful in some cases but most of the time what join already has is probably enough);

Unfortunately, all the time. :wink:

I’m an empirical social scientist, and work with data from lots of sources (census, various government agencies, other researchers, etc.). Those data sources are often supposed to relate in certain ways (Everyone in dataset A should also be in B, but there should be people in B not in A), but they almost inevitably don’t. Just doing a Left or Right join would get only the people who fit the promised relationship, but I (a) want to know how many people didn’t merge right (so I usually do an Outer join, then check the merge-state, and (b) when things go wrong, I want to look at the errors (so it’s nice to quickly query the people who are in A but not in B, in the example above).

I recognize this is not a problem that many CS people (with nice, single-source datasets) or people in the private sector (working with data from a single source, or from a SQL database that has inbuilt checks for these kinds of things) deal with much, but I think this is super common among social scientists.

2 Likes

Great, will open over there!