Vcat does not append DataFrame rows as it did in 0.10

A super inefficient but effective way to deal with this would be to gather all the little data-frames, vcat, then spread again.

I think that amalgamate should throw an error if it needs to expand the type.

I disagree; or more generally, I think that there should be a function (possibly under a different name) which just “does the right thing” by selecting the sufficiently wide container type. This is what cat & friends, map, and collect do, and it is very common in idiomatic Julia code.

3 Likes

Agreed. I think what this thread is showing is the divide between database users and “scientific users” (however defined). Whereas I’m used to the idea that a column supports null values unless explicitly disallowed, others like the strictness of manually specifying that null is allowed (I presume in some cases, where data are generated from experiments and the like and null is less common).

I’m a fan of using union for the missing always allowed/database behavior, as I think the name aligns with the segment of users expecting that behavior. How you deal with unioning two dataframes with differing columns could be as strict as SQL (meaning, you’d have to define the common structure on each side of the union) or the plyr rbind_all. I’m a let everything slide type of guy, but that’s because I prefer to use more hardware than to try and squeeze out every last drop of performance from a function.

2 Likes

My understanding is that the reason for the transition from Nullable{T} to Union{Missing, T} was that the “strict” approach became cumbersome to deal with, especially when programming complex data transformations.

My current (but continuous evolving) approach to programming in Julia is to mix mostly type stable code with a modicum of type unstable code, ideally at carefully selected points, not unlike expansion joints for bridges. It needs to happen, and with carefully organized code the performance cost is trivial.

I don’t see a strong case for erroring in case type expansions are required. I consider this a performance issue, to be dealt with (if necessary) when I am convinced that the code is correct. Until then, this is just a distraction, and I am likely to refactor the whole thing 5-10 times before I get there. For non-interactive use, we also have the appropriate test facilities for dealing with it, eg @inferred.

Yeah, I meant in the more general case. Julians seem to fall into two cases: people who love the strict typing and (per your later comment) people that like Julia for being faster/more elegant, but aren’t chasing CPU cycles. I’m definitely the latter.

For what it’s worth, these issues go a little deeper than “chasing CPU cycles”. Any time you actually have to do something with data, it has to be put in a form that other programs understand. missing (and its equivalents in other languages) have no universal meaning, they all have to be removed somehow before data containing them is useful for just about anything you can imagine. I’ve expended a huge amount of time and energy preparing carelessly formatted datasets to be ingested into some algorithm or another. In many of those cases that all would have been elided if the people in charge of the databases had more carefully though through there choices. Every time a database operation is performed without consideration for what it actually means for the underlying data makes life more difficult for those who have to figure out how to use it.

So no, I definitely don’t want the DataFrame API to just say “hm, not sure what you wanted here, I’ll just add some missings”, because I’m going to have to clean up that mess later one way or another. This is one of the reasons I like to get to a point where containers are not in the form AbstractArrray{Union{T,Missing}} at all (in my work-flow, not in general).

Similar things can be said about typing: you don’t want to go converting Float64 to Float32 for no reason, you don’t want a UInt64 to become an Int64, you don’t want data that has no conceivable reason to be a string to be a string (this is a big one), you don’t want to change the precision of DateTime (a loathsome data type from the outset) etc.

Despite this rant, I really am ok with having “less strict” functionality. Sometimes you just have to “do what you have to do”, I’m definitely sympathetic to that (after all, many of the issues I mentioned above are not really relevant to DataFrames.jl). However, we should segregate that functionality somewhat and try to make sure that in most cases it is not “default” and cannot arise from simply being careless.

1 Like

Yes, after 20 years in industry, I’m quite aware :joy: But pandas, R, SAS, databases (and I’m sure other environments), they all allow inline missing as default. It’s only in Julia where I see strictness being touted as a virtue/automatic missing as a detraction.

I can’t speak for R and SAS, but for pandas and SQL this could certainly become a huge pain in the ass (pandas isn’t as bad in some cases because of NaN, but worse in others because it has comically many different types of missing).

I think what happened with Julia is that it was designed for scientific computing from the outset, so many (if not the majority of) people probably very much have the ultimate use of data in mind when they talk about DataFrames and databases and just want to limit the insanity.

I understand your position, but I think that there is a fundamental tension between the two approaches that just can’t be resolved. The only solution I see is having separate packages (or even ecosystems of packages) with different semantics, which are each more or less coherent internally.

I see DataFrames as catering to the users coming from R and similar languages, who expect DWIM semantics for exploratory data analysis. It is already more disciplined in some respects than the R ecosystem, but pushing this further would just make life difficult for this crowd.

5 Likes

That’s why I’ve said that I’m fairly satisfied with the status-quo, and I’m ok with adding functions (or even keyword arguments) that are less strict, as long as the “default” behavior is strict (particularly when this involves functions from Base). I’m getting the sense that we all more-or-less agree with this viewpoint, no?

Honestly I’d probably be perfectly happy to “separate myself” and keeping everything in hierarchical data structures like my colleagues and I all did when I was in physics. This becomes extremely difficult when everybody around me wants absolutely everything in tabular format (it’s even been suggested to me that non-tabular formats are unholy abominations). Most of the data I’m getting was not formatted by me, so I rarely have much choice about what it looks like until I put in the massive amount of labor that is usually involved in making it useful.

Assuming that “strict” means not adding missing, I disagree about this; for the reasons outlined above, at least for DataFrames. At the same time, I would completely agree with you when it comes to <: AbstractArray types. That’s why I think that having a different function for each operation would be reasonable, even though there is an intersection in functionality for some cases.

I see DataFrames as catering to the users coming from R and similar languages, who expect DWIM semantics for exploratory data analysis. It is already more disciplined in some respects than the R ecosystem, but pushing this further would just make life difficult for this crowd.

I would agree with this. I would imagine that if dataframes is too strict, people will just make every column in the dataframe be Union{T, Missing} as soon as its created.

I understand the CS vs applied gap - CS tends to want things to be pure and efficient computationally, hence the whole kerfuffle with missing and Nullable in DataFrames 0.10.

However I do not understand the purported divide between “science” and “data” use. These discussions seem to imply that “Scientists” have perfect, unchanging datasets that with no column names, like maybe a stress tensor, derived from first principles. Ok, sometimes.

But that seems overly constrained - what question would I even ask here to understand this difference? In my mind… everything is data but clearly the “science” users have something subtly different in mind. And this should be on another thread…

No, ideally “science” users just have carefully formatted datasets for which the route from data stored in a database to data ingested by an algorithm is unambiguous. It doesn’t mean that data has to live in the dataset as a fundamental abstraction. What I’m asking is for DataFrames to make it easy to do this. If I do a whole bunch of data manipulations I want them to throw an error and tell me if I am inadvertently doing something that’s going to make it harder for me in the long run to use my data. If I have to override that, I’d like to be able to, but I don’t want any ugly surprises. There is no way around paying for your sins when you actually use your data, so I prefer to remain righteous in the first place, wherever possible.

By the way, I was never talking about science vs private industry in the first place. In my current role as a data scientist, I still have to get things in an ingestible form just like I did when I was doing physics. The difference is that now that’s immeasurably more difficult. So, in a way I care more now, most of this stuff was just a complete non-issue when I was doing physics (didn’t use tabular formats anyway).

I get the sense that I’m now losing the argument however. Can somebody offer a concrete suggestion for how they’d like vcat to look in the “less strict” way? Are we talking about creating columns and adding missings by default in all cases?

1 Like

Thanks for the clarification on “science”.

What are the open items to be addressed in your suggestion? I see:

  1. Whether to modify columns to allowmissing!
    • Changing type to Union{T, missing}
    • Tricky bit in there if the column is not an Array
  2. Whether to change the actual type of the column
    • e.g. Joining a Int with Float64
  3. Sources with differing column order (fixed in new vcat)
  4. Determining a new column order

Is there more we have talked about?

Ok, for the record here is my position on these:

  1. allowmissing! should be called explicitly. vcat should not do this. I’d be happy with some variant of vcat(df1, df2, strict=false) to override this behavior. If the column is not an Array, allowmissing! will try to do what it can, even if that means converting to a Vector. This is now acceptable because the user explicitly asked for it.
  2. Types should never be changed by default. Again, I’m open to either a keyword or another function. I suppose the keyword should do some kind of promotion? I don’t think I care that much because I would probably always opt to explicitly convert myself.
  3. vcat should be allowed to change column order. (I think that’s what it does right now?) This might be a little unfortunate if some subset of your dataframe is supposed to be a matrix, but I tend to think of DataFrames as a Dict{Symbol,<:AbstractVector} so I think people should be conscious of that if they think they have matrices.
  4. The simplest solution is probably just that the column order is the order of the first data frame, followed by the second etc. (I realize that statement is confusing, but I think people now what I mean?) I think it’s a good idea to preserve the column order of at least one of the dataframes to whatever extent possible.

I think I was getting myself horribly off-track, so I thought it might be helpful to succinctly summarize my views, for whatever that’s worth.

1 Like

Those 4 sound reasonable:

  1. There should still be a parameter to vcat (or whatever it’s named) to do this missing thing, like add_new_columns = true, It’d throw if a new column didn’t allow missing.
  2. Agree, no actual Type conversion - that’s a clear error.
  3. I don’t expect or care much about column order in a DataFrame.
  4. See 3.
2 Likes

vcat might need stricter behavior. This applies to the behavior I would like append to have for DataFrames.

using DataFrames, Missings
upper = DataFrame(a = [1, 2, 3], b = [4, 5, 6])
#= 
│ Row │ a │ b │
├─────┼───┼───┤
│ 1   │ 1 │ 4 │
│ 2   │ 2 │ 5 │
│ 3   │ 3 │ 6 │
=#
lower = DataFrame(a = [11, 12, 13], b = [14, 15, missing], c = [17, 18, 19])
#=
│ Row │ a  │ b       │ c  │
├─────┼────┼─────────┼────┤
│ 1   │ 11 │ 14      │ 17 │
│ 2   │ 12 │ 15      │ 18 │
│ 3   │ 13 │ missing │ 19 │
#=

When I append upper and lower I would like to see:

 #Throws an error currently, I would like this to not throw an error
append!(upper, lower)
6×3 DataFrames.DataFrame
│ Row │ a  │ b       │ c        │
├─────┼────┼─────────┼──────────┤
│ 1   │ 1  │ 4       │ missing  │
│ 2   │ 2  │ 5       │ missing  │
│ 3   │ 3  │ 6       │ missing  │
│ 4   │ 11 │ 14      │ 17       │
│ 5   │ 12 │ 15      │ 18       │
│ 6   │ 13 │ missing │ 19       │




  1. Whether to modify columns to allowmissing!
  • When when a column doesn’t exist in one dataframe, fill it with missings for that section.
  • When one column is of type Union{T, Missing} and the other is of type T, promote the columns as a whole to allow missings
  1. vcat already promotes types. If x if Int64 and y is Float64, vcat promotes the concatenated vector to Float46, at least on 6.2. I would assume this should be for DataFrames too.
  2. Column order shouldn’t matter, that’s what named indices are for
  3. Determining a new column order should be the column order of the upper one + the extras at the end?

Additionally, I feel like I will overwhelmingly want to propagate missings in day to day use, making me vote for strict being a keyword argument.

2 Likes

Would

vcat(fill("one", 3), fill(1.0, 3))

and

vcat(fill(1, 3), fill(1.0, 3))

violate this?