Vcat does not append DataFrame rows as it did in 0.10

I think @kevbonham is referring to the situation which sometimes occurs in Julia in which a package is deliberately designed to throw an error in a specific circumstance, but does not have a dedicated error check per se, so instead the stack goes through a cascade of functions ending with one where there is “No method defined” for something (though it could be a completely different error). This is fine, but if it is the sort of error for which it is intended that users actually see, it is nice when the error message tells you explicitly what you should be doing differently. (At least that’s how I interpreted the comment, and this is something I agree is very relevant in DataFrames.)

1 Like

So something like the force keyword in Stata, which changes column types as needed, would be useful. However I still maintain that requiring the columns be in the same order is overly punitive.

It doesn’t sound like there need to be separate functions, though, right? Otherwise we get somewhere we we have one set of functions for clean, large datasets and another for messy, small, relational ones.

No, this is a huge deal, seriously. You definitely don’t want to do this to your 10^7 row dataframes. There is a good chance you don’t want to do this if you columns are in fact wrapped pointers to some data buffer, or StaticArrays or any of another possible column types.

When it comes to actually using data and putting it into some optimization, machine learning, differential equations, integrations or whatever else, all this stuff matters, because at the end of the day everything is just going into some Array{Float64,N} or Array{Int64,N}.

(Edit: for some reason I was still thinking of types here, not ordering. :blush:)

What you have is one set of functions that respects types and is consistent with the behavior of AbstractArray in Base (to what extent possible) and another set of functions which will try its best to manipulate things the way you want even if that means using a little extra memory or CPU time. I don’t see any way around that. Otherwise dataframes become useful for relational database operations (for small datasets), but obstructive when actually utilizing the data they are storing.

Perhaps I don’t understand the internals of DataFrames well enough, but why is rearranging the order of columns so expensive? I would have thought that the order of columns is something separate from the layout of the data in memory.

Ugh, sorry I completely misunderstood you, for some reason I was still stuck on types. My apologies.

+1 for amalgamate! (just kidding)

Also, @pdeffebach, the latest PR for vcat on DataFrames allows columns to be in different orders. The append! function requires them to be in the same order still.

I think that in the particular context of this thread, a keyword to vcat may be appropriate. In general though, like you said @nalimilan, each step we take further away from matrices, the less that vcat seems like the right thing to be using. I’d like to think of vcat as some sort of optimized special case of something much more general. Perhaps the append function could be repurposed (which may be blasphemy to suggest)?

1 Like

Let’s say we introduce a new vcat-like function (say, amalgamate ;)) that allows missing columns. I still don’t think it should promote the column type to allow missing.

I would expect both amalgamate and vcat to respect normal type promotion of vectors. So a column type in the amalgamated dataframe should be equal to the type one would get from concatenating the individual columns (or at the very least the element type should be the same).

This type promotion implies that a column name needs to be present in all dataframes unless one of the concatenated columns already allows missing. So for columns that don’t allow missing values amalgamate and vcat would behave exactly the same. But when the promoted column type allows missing values then amalgamate would allow a column to be missing and vcat would not.

I would be surprised if append! changed any column type (or added new columns). But if an existing column allows missing values I would expect append! to not require that column in appended dataframes.


So even if we had a keyword argument or separate function that would allow missing columns I think those columns must already support missing values first. This ensures that a column that doesn’t allow missing values silently gets a missing added anyways.

1 Like

That would be consistent and rigorous, IMO. So to play that through with my example above,

julia> d1 = DataFrame(a = 1, b = 2); d2 = DataFrame(b = 3, d = 4);
julia> amalgamate(d1, d2)
ERROR: ArgumentError: column(s) d are missing from argument(s) 1, and column(s) a are missing from argument(s) 2

julia> allowmissing!(d1); allowmissing!(d2);
julia> amalgamate(d1, d2)
2×3 DataFrames.DataFrame
│ Row │    a    │ b │    c    │
├─────┼─────────┼───┼─────────┤
│ 1   │ 1       │ 2 │ missing │
│ 2   │ missing │ 3 │    4    │

Honestly the developer really should be checking for these things anyway, but the language design issue is to protect them from surprises.

1 Like

What do you expect the value to be, if not for missing? If you are going to concatenate the datasets, I don’t think it can be anything other than missing. The strict behavior of throwing an error is already present in vcat, that’s why this thread exists.

1 Like

It sounds like the solution is just a keyword argument for vcat or append that allows the user to say its okay to add a bunch of missings.

1 Like

Just to clarify: would the semantics of

using DataFrames
using Missings

function amalgamate(a, b) # not optimized, just for illustrating semantics
    _getcol(df, colname) = colname ∈ names(df) ? df[colname] :
         fill(missing, size(df, 1))
    DataFrame((colname => vcat(_getcol(a, colname), _getcol(b, colname))
               for colname in union(names(a), names(b)))...)
end

be a reasonable solution? It only expands types when it needs to.

Yes, I think we agree on these semantics.

I’m still not convinced that introducing a new function is a good idea. First, amalgamate doesn’t sound great, and that’s not a term used by any other software AFAIK. More importantly, if we introduce a new, less strict function, it is likely people will always use it rather than vcat, even when vcat would work. Then they won’t benefit from the check that all columns are present in both inputs, which is a very reasonable default and can catch bugs in most common cases.

Regarding append, that’s a separate issue, but as I said it would make sense to change it to reorder columns just like vcat did in 0.10 and does again on git master.

I totally agree :smile:, but — conditional on having a separate function — the way to improve on this is by suggesting a better name. Note that I am not pushing for amalgamate, I just picked it from the synonyms of “merge” to be able to experiment with the idea in code.

As you point out above, base R does not have this functionality, so it is not surprising that there is no name for it. dplyr::bind_rows looks equally accidental, so we may as well depart from it. It would be interesting to hear how Python handles this. But if the functionality is not present or not distinguished from the case with matching columns name & order in other languages, we should be free to make up our own.

I think this is a reasonable argument. My objection is having size(..., 2) differ between the inputs and outputs of vcat — this is not explicitly mentioned in ?vcat, but ?cat does talk about it. Because of this, my preference would be for a separate function.

I only used amalgamate in my example because it is such a bad name that I thought it was clear that it wouldn’t be used. I will be more careful about context going forward and not count on :slight_smile: to indicate such.

Honestly append is closer to correct but it’s sort of taken. It could be a derivative like append_all or vcat_all.

Those are all reason why I’d be pretty content with leaving things strict, as they are, but I don’t really feel like I can tell everybody “No, your use case is wrong, don’t do that” (if it were up to me, which it isn’t).

Another alternative would be to leave vcat alone but to add a function that makes dataframes somehow more compatible, perhaps makesimilar!(df2, df1). This could, for example, ensure that df2 has columns names(df1) ∪ names(df2) and that the element types of df2 are made “compatible” with df1 wherever possible.

If things were kept the same, and no new functions were made, then what would be a workflow to enable the changing radio-telescope data problem described above? It’s not an unusual case.

Maybe iterate through the files, look at setdiff(names(d1), names(d2)) for each new file, and create new missing enabled columns on d1 as needed, maybe log an info() that I added a column, then do a vcat(d1, d2). In reality I’d create a local or packaged function to do all of that, but it would work.

The most common use case I can imagine for this is as an input to “strict” vcat, so it may make sense to just have one function.

In any case, one can abuse join to do this:

using DataFrames
df1 = DataFrame(a = 1:2)
df2 = DataFrame(b = 1:3)

n1 = size(df1, 1)
n2 = size(df2, 1)
df1[:extra] = 1:n1
df2[:extra] = (1:n2) + n1
df3 = join(df1, df2, on = :extra, kind = :outer)
df3[setdiff(names(df3), [:extra])]

so perhaps an extra kind argument would do the trick.

For the record, I’m not opposed to adding keywords to vcat, as long as the default behavior is strict and consistent with Base. If it were just me I’d think a new function or set of functions would be a better option, but @nalimilan makes a good point: it would be bad if vcat never got used, even in situations where it definitely should.

1 Like

As @nalimilan pointed out:

But I do think that in order to allow missing columns one must first allow missing values.

I’m proposing that there are two separate things that can throw an error:

  • first, of course, is whether to allow any missing columns at all (normal vcat throws this kind of error whereas amalgamate wouldn’t)
  • secondly, the eltype must allow missing values before any missing value can be filled in (only applicable to amalgamate)

To continue @pasha’s example with allowmissing! this is the kind of behaviour I’m looking for:

julia> d1 = DataFrame(a = 1, b = 2); d2 = DataFrame(b = 3, c = 4);
julia> allowmissing!(d1, :a)
julia> allowmissing!(d2, :c)
julia> eltype.(eachcol(amalgamate(d1, d2)))
3-element Array{DataType,1}:
 Union{Missing, Int64}
 Int64
 Union{Missing, Int64}

And if all columns allow missing the result should also allow missing in all columns:

julia> allowmissing!(d1)
julia> eltype.(eachcol(amalgamate(d1, d2)))
3-element Array{DataType,1}:
 Union{Missing, Int64}
 Union{Missing, Int64}
 Union{Missing, Int64}

So my answer to the use case

is that optional columns could be declared as allowing missing values before calling amalgamate, and columns that should always have a value should not need to allow potential missing values.

TLDR I think that the only difference between vcat and amalgamate should be that amalgamate interprets “allow missing values” (Union{Missing, T}) to imply “allow missing columns”.

(I’m not proposing we call it amalgamate either, I’m merely discussing the behaviour of a function that allows missing columns. Whether it’s implemented through a keyword argument or separate function is a separate discussion, although I also lean towards keyword argument on vcat)

1 Like

This is along the lines of what I had in mind when I said “merging on all unique ids” above. Perhaps vcat could remain strict and join starts being a bit more flexible. Although, I don’t know how optimized join is relative to vcat.