Vcat does not append DataFrame rows as it did in 0.10

In DataFrames 0.10, we could use vcat() as a way to append differently-shaped DataFrames, much like R’s dplyr::bind_rows(). You could send two DataFrames with differing columns or column order and it would append the rows to the right place. This is prior behavior is described by a stackoverflow question and this issue.

Having recently upgraded to DataFrames 0.11, I’m porting code and noticed that this vcat behavior is no longer true. MWE:

julia> d1 = DataFrame(a = 1, b = 2); d2 = DataFrame(b = 3, a = 4); vcat(d1, d2)
ERROR: ArgumentError: column order of argument(s) 1 != column order of argument(s) 2

julia> d1 = DataFrame(a = 1, b = 2); d2 = DataFrame(b = 3, d = 4); vcat(d1, d2)
ERROR: ArgumentError: column(s) d are missing from argument(s) 1, and column(s) a are missing from argument(s) 2

It looks intentional in the code. What is the proper use case now?

Edit: I’m using Dataframes 11.5 and on that tag, you can see that the team is actually testing against the above errors in test/cat.jl , but they are no longer testing against it in master.

So… this implies that it will be fixed soon and someone changed their mind? I can’t find any issues describing this thought process.

1 Like

The first error (order of columns) has been recently fixed on master, should be in the next release:
https://github.com/JuliaData/DataFrames.jl/pull/1366

Regarding the second one, it’s not clear how it should be supported. FWIW R’s rbind doesn’t allow it, only dplyr::bind_rows does. I guess we could allow this via a keyword argument. Cc: @bkamins

2 Likes

Ok, thanks for the quick reply and awesome coding.

I’d say that in R, that the dplyr usage is the dominant pattern. I may be projecting, but most people don’t use the base-level data.frame functionality unless they have to or it’s more efficient in an edge case (selective updates for example).

So those dplyr patterns are the one to follow, IMO, because dplyr really came in and fixed the broken things about the base implementation.

In general I agree that dplyr is the best model, but we are also sometimes a bit stricter than R, and I think that can be fine if we provide a convenient enough way of doing what you need.

In the case at hand, I couldn’t find a rationale for the behavior of bind_rows in dplyr. It could make sense to throw an error and tell the caller to enable the less strict behavior explicitly. That could help catch bugs early, which can actually be more useful to avoid wasting time finding out where unexpected missing values come from when it’s just that a column was absent in one of the input data frames. That would be consistent with the fact that we don’t allow missing values in columns by default. Not sure.

That sounds rational.

Here’s the use case if it helps: I’m combining radio telescope data from multiple runs, one CSV for each run. But there are several versions of files, basically adding columns as new capture software features are enabled. I want to put all of this CSV mess into one big Feather database, maybe mmap’ed. So, my goal is to bind the rows from all generations of files, knowing that it’s a changing source format.

So in my case, a extra parameter like vcat(df1, df2, add_missing_columns = true) is totally fine. Or, I’ll write something to do this, but I don’t think my use case is unusual. Thanks again!

We have something in the works where you will actually be able to use Feather as if it were a database with Query.jl or DataFramesMeta.jl queries which only touch the data that you actually need. If you are interested check out this PR and this package which implements the back-end. I’m probably getting a little ahead of myself by advertising this, but we could really use people to test it and suggest improvements for real-world use cases, if that’s something you’re interested in doing. (The code is pretty much done, we just have some organizational issues to take care of.)

I definitely agree that the default behavior of vcat should throw an error in the presence of extra columns. I like the keyword argument idea as well.

2 Likes

I also agree to have vcat strict by default but adding a keyword argument would be a good option (although it is not accessible via [df1; df2] syntax).

The only question is the following: the implementation of this functionality requires a bit different code than the one we currently have - maybe a separate function would be preferred to a keyword argument, e.g. bindrows? (it is a lose suggestion I do not have a strong preference)

I think the default should definitely be to throw an error. It is not obvious that in every case you would want what OP is requesting. Supporting this behavior via keywords could be ok, but @bkamins has a point that it wouldn’t cover all situations.

I think what is becoming apparent (with this post, the linked PR, and other discussions) is that when playing with DataFrames people are using vcat but really looking for something more flexible and more like a traditional merge or join type of behavior (in the sense that vcat is like merging on all unique ids). Perhaps a new function should fill this gap?

2 Likes

That’s a good observation, especially combined with @bkamins remarks. Perhaps a new function with methods for concatenation as well as other types of “merge” behavior is in order? I’m not sure what that would look like. As I recall, pandas has dedicated concatenation functions, so it probably more closely resembles the current state of affairs in DataFrames, though in pandas case I’d expect it to annoyingly create new columns will-he-nill-he without throwing errors. I think having vcat default to errors, but another function with more general functionality and more leniency may be ideal (again, I haven’t given any real thought to what that would look like).

That sounds wise. I can see the problem with silently creating rows of missing by the operation. Another function could fill that bill.

It’s not a join in the CS sense, but it’s not far either. The bind word may or may not be the right term; R has a bad habit of using wrong words for things because the good names are taken, or they seemed cute at the time. We don’t have that problem yet.

@ExpandingMan that Package and PR are very exciting! I am doing most of the processing on multithreaded map-reduce, but for slower ad-hoc ops we also want have a copy of the full dataset available.

1 Like

Stata calls this “append”, and has options for adding rows that are on top, but not bottom, bottom but not top, neither and both. It also creates missings silently, which is what missings are for, right? Other than that the operation will potentially change the type of some columns.

A column type says whether missing is allowed or not in that column. Could we use this to rule whether vcat is allowed to fill in missing columns?

Example

> A = DataFrame( a = [1] )
> B = DataFrame( a = [2], b = [3] )
> [A; B]
Error: column `b` doesn't allow missing values but the column is missing from one or more of the concatenated DataFrames. You can use `allowmissing!` to allow missing values.

To concatenate with missing columns one could then do:

> allowmissing!(B, :b)
> [A; B]
DataFrame( a = [1, 2], b = [missing, 3] )

This way the user doesn’t need to learn about any extra keyword arguments or new functions. The user gets an actionable error message when they try what is most straightforward. And it also works with [A; B] instead of explicitly calling vcat.

There’s indeed a difficulty with vcat on data frames, which is that column have both an order and names, so the analogy with matrices breaks when names are not in the same order. We discussed this for PR #1366.

Here the situation is a step further from matrices, since some columns are altogether absent. But it doesn’t sound completely absurd to use vcat for this, as it can be seen as a generalization of matrix vcat where we know which columns match thanks to their names. In the context of AxisArray or NamedArray, vcat could also allow this possibility via an argument. I’m not sure there’s a lot to gain by introducing a different function (but maybe I’m missing it). What kind of additional feature would we want to add to it? dplyr::bind_rows also accepts a list of data frames, but in Julia we’d better use an optimized reduce(vcat, dfs) method for that.

If we want a different function, join is probably not the right one. In SQL, this would rather be UNION ALL CORRESPONDING or OUTER UNION CORRESPONDING (which exist in SAS but not in many other implementations apparently). See discussion in dplyr.

I don’t think we should do that. Whether a column allows missing values does not indicate whether you expect it to be present in all concatenated data frames. And calling allowmissing! isn’t clearer nor easier to find for users than a keyword argument to vcat.

Actually we already have append!, but it requires column names to match and be in the same order. So we could also add a keyword argument to it. But since it’s in-place that function cannot add missing values to columns which do not support them (it could replace column vectors in that case but that’s not what it does, at least currently). And of course there’s no append which would return a new data frame.

1 Like

IMO that would be the best option, instead of a keyword for some other function.

But why? And what name would you suggest (that’s the hard part :wink: )?

When names(df1) == names(df2), this implies names(vcat(df1, df2)) == names(df1), which is a nice invariant. When this does not hold, I would prefer to call it something else.

OTOH, having names(_op(df1, df2)) == union(names(df1), names(df2)) is reasonable too, which the above as a special case.

Regarding the hard part :smile:, possibly a synonym for “merge”, eg amalgamate, but I can’t think of anything that would express intent clearly.

1 Like

Requiring that the columns have the same name and order seems odd, and contrary to the “non-matrix” design intentions of dataframes. Wouldn’t it be best to just change the default behavior or append? Whats the reasoning for having such a specific function?

And is automatic promotion to missing a concern? Because silently generating missing values is what Missings is for, right?

Yes, it is definitely a concern, you don’t know what kind of AbstractVector a person has in a DataFrame and besides you would not want to silently call such a promotion on a 10^7 row dataframe. It is imperative that the methods for DataFrame reflect the underlying AbstractVector interface of its columns.

I think as a rule of thumb it would make sense if any functions that come from the AbstractArray interface should have strict rules and throw errors by default, possibly with keyword arguments to elide the default behavior. We can have separate relational database functions which are less strict, though I certainly think most of those should still err on the side of leaving containers unaltered wherever possible.

Definitely agree with this, so long as the errors are informative. Nothing worse than using a function a million times, and have it all of a sudden stop working with something like “No method defined for ” and have no idea what to do.

1 Like

I am on the fence about this when designing an interface. I agree that making a suggestion (“no method defined, so go and define a method for this signature if you want that to work”) can be useful, especially as it indicates that this is part of the interface and not a bug, OTOH this basically a rehashing of the information the exception already conveys.