I think @kevbonham is referring to the situation which sometimes occurs in Julia in which a package is deliberately designed to throw an error in a specific circumstance, but does not have a dedicated error check per se, so instead the stack goes through a cascade of functions ending with one where there is “No method defined” for something (though it could be a completely different error). This is fine, but if it is the sort of error for which it is intended that users actually see, it is nice when the error message tells you explicitly what you should be doing differently. (At least that’s how I interpreted the comment, and this is something I agree is very relevant in DataFrames.)
So something like the force
keyword in Stata, which changes column types as needed, would be useful. However I still maintain that requiring the columns be in the same order is overly punitive.
It doesn’t sound like there need to be separate functions, though, right? Otherwise we get somewhere we we have one set of functions for clean, large datasets and another for messy, small, relational ones.
No, this is a huge deal, seriously. You definitely don’t want to do this to your 10^7 row dataframes. There is a good chance you don’t want to do this if you columns are in fact wrapped pointers to some data buffer, or StaticArray
s or any of another possible column types.
When it comes to actually using data and putting it into some optimization, machine learning, differential equations, integrations or whatever else, all this stuff matters, because at the end of the day everything is just going into some Array{Float64,N}
or Array{Int64,N}
.
(Edit: for some reason I was still thinking of types here, not ordering. )
What you have is one set of functions that respects types and is consistent with the behavior of AbstractArray
in Base
(to what extent possible) and another set of functions which will try its best to manipulate things the way you want even if that means using a little extra memory or CPU time. I don’t see any way around that. Otherwise dataframes become useful for relational database operations (for small datasets), but obstructive when actually utilizing the data they are storing.
Perhaps I don’t understand the internals of DataFrames
well enough, but why is rearranging the order of columns so expensive? I would have thought that the order of columns is something separate from the layout of the data in memory.
Ugh, sorry I completely misunderstood you, for some reason I was still stuck on types. My apologies.
+1 for amalgamate
! (just kidding)
Also, @pdeffebach, the latest PR for vcat
on DataFrames allows columns to be in different orders. The append!
function requires them to be in the same order still.
I think that in the particular context of this thread, a keyword to vcat
may be appropriate. In general though, like you said @nalimilan, each step we take further away from matrices, the less that vcat
seems like the right thing to be using. I’d like to think of vcat
as some sort of optimized special case of something much more general. Perhaps the append
function could be repurposed (which may be blasphemy to suggest)?
Let’s say we introduce a new vcat-like function (say, amalgamate
;)) that allows missing columns. I still don’t think it should promote the column type to allow missing.
I would expect both amalgamate
and vcat
to respect normal type promotion of vectors. So a column type in the amalgamated dataframe should be equal to the type one would get from concatenating the individual columns (or at the very least the element type should be the same).
This type promotion implies that a column name needs to be present in all dataframes unless one of the concatenated columns already allows missing. So for columns that don’t allow missing values amalgamate
and vcat
would behave exactly the same. But when the promoted column type allows missing values then amalgamate
would allow a column to be missing and vcat
would not.
I would be surprised if append!
changed any column type (or added new columns). But if an existing column allows missing values I would expect append!
to not require that column in appended dataframes.
So even if we had a keyword argument or separate function that would allow missing columns I think those columns must already support missing values first. This ensures that a column that doesn’t allow missing values silently gets a missing
added anyways.
That would be consistent and rigorous, IMO. So to play that through with my example above,
julia> d1 = DataFrame(a = 1, b = 2); d2 = DataFrame(b = 3, d = 4);
julia> amalgamate(d1, d2)
ERROR: ArgumentError: column(s) d are missing from argument(s) 1, and column(s) a are missing from argument(s) 2
julia> allowmissing!(d1); allowmissing!(d2);
julia> amalgamate(d1, d2)
2×3 DataFrames.DataFrame
│ Row │ a │ b │ c │
├─────┼─────────┼───┼─────────┤
│ 1 │ 1 │ 2 │ missing │
│ 2 │ missing │ 3 │ 4 │
Honestly the developer really should be checking for these things anyway, but the language design issue is to protect them from surprises.
What do you expect the value to be, if not for missing
? If you are going to concatenate the datasets, I don’t think it can be anything other than missing
. The strict behavior of throwing an error is already present in vcat
, that’s why this thread exists.
It sounds like the solution is just a keyword argument for vcat
or append
that allows the user to say its okay to add a bunch of missings.
Just to clarify: would the semantics of
using DataFrames
using Missings
function amalgamate(a, b) # not optimized, just for illustrating semantics
_getcol(df, colname) = colname ∈ names(df) ? df[colname] :
fill(missing, size(df, 1))
DataFrame((colname => vcat(_getcol(a, colname), _getcol(b, colname))
for colname in union(names(a), names(b)))...)
end
be a reasonable solution? It only expands types when it needs to.
Yes, I think we agree on these semantics.
I’m still not convinced that introducing a new function is a good idea. First, amalgamate
doesn’t sound great, and that’s not a term used by any other software AFAIK. More importantly, if we introduce a new, less strict function, it is likely people will always use it rather than vcat
, even when vcat
would work. Then they won’t benefit from the check that all columns are present in both inputs, which is a very reasonable default and can catch bugs in most common cases.
Regarding append
, that’s a separate issue, but as I said it would make sense to change it to reorder columns just like vcat
did in 0.10 and does again on git master.
I totally agree , but — conditional on having a separate function — the way to improve on this is by suggesting a better name. Note that I am not pushing for amalgamate
, I just picked it from the synonyms of “merge” to be able to experiment with the idea in code.
As you point out above, base R does not have this functionality, so it is not surprising that there is no name for it. dplyr::bind_rows
looks equally accidental, so we may as well depart from it. It would be interesting to hear how Python handles this. But if the functionality is not present or not distinguished from the case with matching columns name & order in other languages, we should be free to make up our own.
I think this is a reasonable argument. My objection is having size(..., 2)
differ between the inputs and outputs of vcat
— this is not explicitly mentioned in ?vcat
, but ?cat
does talk about it. Because of this, my preference would be for a separate function.
I only used amalgamate
in my example because it is such a bad name that I thought it was clear that it wouldn’t be used. I will be more careful about context going forward and not count on to indicate such.
Honestly append
is closer to correct but it’s sort of taken. It could be a derivative like append_all
or vcat_all
.
Those are all reason why I’d be pretty content with leaving things strict, as they are, but I don’t really feel like I can tell everybody “No, your use case is wrong, don’t do that” (if it were up to me, which it isn’t).
Another alternative would be to leave vcat
alone but to add a function that makes dataframes somehow more compatible, perhaps makesimilar!(df2, df1)
. This could, for example, ensure that df2
has columns names(df1) ∪ names(df2)
and that the element types of df2
are made “compatible” with df1
wherever possible.
If things were kept the same, and no new functions were made, then what would be a workflow to enable the changing radio-telescope data problem described above? It’s not an unusual case.
Maybe iterate through the files, look at setdiff(names(d1), names(d2))
for each new file, and create new missing
enabled columns on d1 as needed, maybe log an info()
that I added a column, then do a vcat(d1, d2)
. In reality I’d create a local or packaged function to do all of that, but it would work.
The most common use case I can imagine for this is as an input to “strict” vcat
, so it may make sense to just have one function.
In any case, one can abuse join
to do this:
using DataFrames
df1 = DataFrame(a = 1:2)
df2 = DataFrame(b = 1:3)
n1 = size(df1, 1)
n2 = size(df2, 1)
df1[:extra] = 1:n1
df2[:extra] = (1:n2) + n1
df3 = join(df1, df2, on = :extra, kind = :outer)
df3[setdiff(names(df3), [:extra])]
so perhaps an extra kind
argument would do the trick.
For the record, I’m not opposed to adding keywords to vcat
, as long as the default behavior is strict and consistent with Base
. If it were just me I’d think a new function or set of functions would be a better option, but @nalimilan makes a good point: it would be bad if vcat
never got used, even in situations where it definitely should.
As @nalimilan pointed out:
But I do think that in order to allow missing columns one must first allow missing values.
I’m proposing that there are two separate things that can throw an error:
- first, of course, is whether to allow any missing columns at all (normal
vcat
throws this kind of error whereasamalgamate
wouldn’t) - secondly, the
eltype
must allow missing values before any missing value can be filled in (only applicable toamalgamate
)
To continue @pasha’s example with allowmissing!
this is the kind of behaviour I’m looking for:
julia> d1 = DataFrame(a = 1, b = 2); d2 = DataFrame(b = 3, c = 4);
julia> allowmissing!(d1, :a)
julia> allowmissing!(d2, :c)
julia> eltype.(eachcol(amalgamate(d1, d2)))
3-element Array{DataType,1}:
Union{Missing, Int64}
Int64
Union{Missing, Int64}
And if all columns allow missing the result should also allow missing in all columns:
julia> allowmissing!(d1)
julia> eltype.(eachcol(amalgamate(d1, d2)))
3-element Array{DataType,1}:
Union{Missing, Int64}
Union{Missing, Int64}
Union{Missing, Int64}
So my answer to the use case
is that optional columns could be declared as allowing missing values before calling amalgamate
, and columns that should always have a value should not need to allow potential missing values.
TLDR I think that the only difference between vcat
and amalgamate
should be that amalgamate
interprets “allow missing values” (Union{Missing, T}
) to imply “allow missing columns”.
(I’m not proposing we call it amalgamate
either, I’m merely discussing the behaviour of a function that allows missing columns. Whether it’s implemented through a keyword argument or separate function is a separate discussion, although I also lean towards keyword argument on vcat
)
This is along the lines of what I had in mind when I said “merging on all unique ids” above. Perhaps vcat
could remain strict and join
starts being a bit more flexible. Although, I don’t know how optimized join
is relative to vcat
.