I’d like to point out that in addition to Errors, we also have Warnings which should be enough to avoid surprises for the users in many cases. (I also like the idea of a keyword argument strict
that defaults to false
.)
Yeah, to be honest ideally I’d want DataFrames to be stricter than Base
when it comes to type conversion and promotion. The chances of this screwing me up when manipulating DataFrames is just much much higher than when I’m writing code to do normal things. A good example that’s definitely happened to me before: some bozo gives me a csv (ugh) that for some reason parses a column that is supposed to be Int
as Float64
, then I concatenate it with data that came from somewhere else with the proper types, and boom all of my (presumably categorical) data is now floats, and I have to be paranoid about converting it back (in case something happened to those floots and they get the wrong floor
or whatever). An even uglier scenario is if something winds up as “disguised” strings and now I have column with these horrible strings everywhere. It’s just all no good for anybody, I can do without it.
I have a feeling I’m going to have a very hard time convincing people that typing in DataFrames should be stricter in Base
, but that would be my preference.
Currently warnings are incredibly slow (I think because they trigger trace backs), so I’d prefer never to have any intentional warnings, but I suppose it might be better than the alternative in a few cases.
That promotion to Any
is some bad mojo that I don’t want to see in my data.
I am afraid I don’t understand the example; if you have the original data in the first place, why do you have to convert back from the botched merge?
In any case, data validation and sanity checks cannot be substituted for by “strict” type handling; similarly to static types not being substitutes for unit tests.
It could be argued that errors are terminally slow
Good question, beats me! I always get stuff that is a horrible mess and have to try to make into something cogent as best I can. My example was that I had data from two different sources: one of them was relatively nice, one of them was awful, and I wound up with something that looks more like the awful. That’s not uncommon where I work, sadly.
I suppose you are right, but writing unit tests for data is unbelievably time-consuming, and I like to do whatever I can to preserve sanity. If I’m in the initial phases of something where I get data in terrible formats, chances are close to 0 that I’ll have time to do unit tests on the data itself. Horrible I know, but this is life as a data scientist apparently.
Point taken. All I was saying is that it’s not so good to have deliberate warnings, i.e. warnings that you’d expect any users to see during “proper” use.
I am afraid I don’t understand what you mean here, can you elaborate?
I find it good practice to do some basic checks for nontrivial datasets after reading. Codebooks are not always in sync with reality, and it is good to catch stuff early (eg some disguised missing value representations like -1
, "X"
, 99999
; I have seen these in the past 2 months ).
So after reading data, I do a couple of
@assert eltype(df[:col1]) == Int
@assert all(0 .< df[:col2])
@assert Set(unique(df[:col3])) == Set(["the", "levels", "I", "expected"])
This saves a lot of grief in the medium run.
I second that… I don’t see a way to handle these issues at the language level. I have horror stories too (CSVs with categorical variables “P” and “F” are fun), but I think the user-programmer invoking vcat
just has to:
- check incoming data’s types and use
vcat(d1, d2, strict=true)
, or - don’t check incoming data, use
vcat(d1, d2, strict=false)
and check the data after.
One idea to streamline this checking is to have a DataFrames.ismatch(d1, d2)
function that validates that all columns have the same types. Maybe typematch
or columnmatch
or IDK.
On 0.6.2 and DataFrames 0.11.5, I get:
julia> vcat("one", 1.0)
2-element Array{Any,1}:
"one"
1.0
The same applies with DataFrames. So if I have a nice numeric column, but when I try to vcat
that column with a CSV that left commas in their numbers, then the merged data will be demoted to Any
. That’s bad, I expected an error there. As mentioned above, it may just have to be solved by pre-checking or post-checking.
While deep down I know you are absolutely correct, I can’t shake the feeling that just having strict dataframe operations alone will save an enormous amount of blood, sweat and tears…
I understand what you are saying, it is just not clear why you think things like vcat("one", 1.0)
should error (I am assuming that the suggestion for DataFrame
s is by analogy).
Julia supports collections with abstract element types without any problems, they are part of the language. Not necessarily performant, but occasionally quite handy.
Maybe vcat("one", 1.0)
shouldn’t error, but it certainly causes problems when merging disparate DataFrames. I hear your consideration that vcat
should be more of a raw concatenation, and that is useful sometimes. If so then then I think Julia could help as a language, by giving DataFrames a safer function at a higher level, or a keyword to vcat
.
Otherwise users everywhere will be writing lots of redundant code to make sure things didn’t break after a merge. You could say vcat_safe(d1, d2)
or just equivalently vcat(d1, d2, strict = true)
or even vcat(d1, d2, strict_types = true, strict_missing = false)
to cover cases (1+2) above separately. If not part of DataFrames, then someone (maybe even I) will end up writing a Module to do it.
Your example above with the asserts brings up the issue of levels in CategoricalArray or PooledArray. I think R’s dplyr::bind_rows
just dumps them down to strings, but maybe here could also be a append_levels
keyword as well.
Another thing that would help all of this is having a more reasoable describe
command. The currenty behavior spits out information for each column one at a time, rather than a table that summarizes everything. Having a more readable format would really help this type of debugging.
How do people feel about having describe
or return a dataframe or similar tabular object rather than simply dumping the results of describe
for each column?
That sounds great, but unfortunately I think some special consideration will have to be given to making it readable on screen for it to also serve the most pedestrian purpose of decribe
. Otherwise we again may need either multiple functions or a keyword.
I have also found the undocumented showcols
function to be extremely useful. There are a few little functions like that which badly need to be documented…
You’re right. This got me thinking and I’ve completely changed my mind on the strictness on vcat
. I no longer think it should throw any errors at all and we wouldn’t even need a strict
keyword argument. vcat
should just add missing
and Union{Missing,T}
wherever needed, similar to what it did before 0.11. The strict function I was looking for would be append!
. Hear me out:
Base.append!
is more strict than Base.vcat
. Some examples:
> append!([1],[1.5])
ERROR: InexactError
> append!([1],["a"])
ERROR: MethodError
append!([1],[missing])
ERROR: MethodError
> vcat([1],[1.5])
2-element Array{Float64,1}:
1.0
1.5
> vcat([1],["a"])
2-element Array{Any,1}:
1
"a"
> vcat([1],[missing])
2-element Array{Union{Int64, Missings.Missing},1}:
1
missing
append!(a, b)
can’t change the type of a
since it’s an in-place operation. The generalisation of this behaviour to a::DataFrame
would in my opinion be to preserve the types of all columns. That is; append!
would not change the type of any column, nor add any new columns, nor change the order.
The generalisation of vcat
on the other hand would then be to just do whatever is needed to accommodate the arguments. Taken a bit extreme not even vcat(DataFrames(a="b"), 123)
throws an error, it just falls back to Array{Any}
if nothing else makes sense.
append!(a, b)
could on the other hand have a keyword argument or two:
append!(a, b, allow_missing_columns=true)
Preserves column types ina
but would allow some missing columns inb
. This implies that the corresponding column type ina
for a missing column must already support missing values.append!(a, b, allow_new_columns=true)
Preserves column types ina
for old columns but makes an exception to also allow new columns. Not sure about how strict to be aboutb
but I think that it would make sense to also respect their column type, i.e any new columns must already have a container type that allows missing values.
What should the default values for allow_missing_columns
and allow_new_columns
be? Both false? Both true?
(The discussion about describe
is interesting, but doesn’t that belong in it’s own thread?)
Holy crap it seems you’re right! Perhaps we’ve all been looking at the wrong function? I’m not totally sure this is the proper solution though since
append!(rand(2,2), rand(2)')
# ERROR: MethodError
so it seems that append!
was intended for Vector
.
Another concern is that we probably need both mutating and non-mutating versions of both the “strict” and “non-strict” forms.
If nothing else this shows that we should have all thoroughly read the Base
Array
documentation before making all these suggestions.
I’m not a fan of the idea that data frames are like matrices. I think they are more like a collection of named columns. Conceptually speaking I’d say append!
of two data frames is like an element-wise append!.(x, y)
(note the dot “.
”) where the elements would be columns of x
and y
. The result is then also a collection of named columns, i.e a data frame.
I mean, technically I guess we could overload append!.(x::DataFrame, y::AbstractDataFrame)
instead but I’m not sure if that would really help anyone. So I don’t think that the existing append!(x::DataFrame, y::AbstractDataFrame)
is too far off from following the intents of Base.append!
.
I think @gustafsson made a good explanation for vcat
better support automatically handle names order and missing by default. for example, I sometime need to aggregate datafreame result from a batch analysis, but they may have different columns, I usually do this in a loop:
d=DataFrame()
for i in loop
d=vcat(d, newdataframe)
end
for better performace, a more aproperate method may be append!
, so i guess append!
needs some keyword arguments to turn on/off the strict mode.