Assignment of a `missing` value fails in DataFrames 0.11.1


#1

The following simple assignment fails. I think this defeats the very purpose of using DataFrames, i.e., allowing missing as legit values. Shouldn’t we have Array{Union{Missing,T},1} as the default type to avoid this issue?

julia> using DataFrames

julia> df = DataFrames(A = 1:4, B = ["M","F","F","M"])

julia> typeof(df[:A])
Array{Int64,1}

julia> df[3,:A] = missing
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Int64...

#2

That’s because your column doesn’t support missing values. You need

julia> df = DataFrames(A = Union{Int, Missing}[1, 2, 3, 4], B = ["M","F","F","M"])

#3

@quinnj is there any more elegant way to initialize dataframes? Something without this explicit Union type in front maybe?


#4

Not at this point. In the future there is the possibility that Int?[1, 2, 3, 4] will work, but that’s not certain yet as the T? syntax can also be useful for Union{T, Void}.

(But note that for real use cases, it’s uncommon to create data frames like this with only four rows AFAIK.)

Also, for more details about this, see the announcement.


#5

Could you clarify the reason for not having Union{T, MIssing} be the default? Is it performance issues? If I were to, for example, read in a .csv where one column had no missing values, would DataFrames throw an error if I were to replace one of them with missing?


#6

Performance, but more importantly correctness. If your data is not supposed to contain missing values, it’s safer to use an array which does not allow for them, so that you can get an error early if for some reason code attempts to set a missing value.

Yes, when importing from CSV, columns with no missing values will buy default not allow setting missing values. It remains to be seen whether in practice it could be a problem, but I doubt it. It’s easy to replace a column instead of recoding it in place when needed, and that allows catching bugs which would introduce new missing values in columns where all values were set before.


#7

Another problem with automatically using Missing is that with the syntax in the OP, A and B are constructed before they are handed to the DataFrame constructor, so if DataFrame wanted to change the element type it would have to copy all the data into a new array. I guess if DataFrame already copies its data then it might be feasible to widen to Union{T, Missing} in the process, but that seems like a lot for default behavior to do.


#8

A surprising number of survey softwares encode “refusals” “not applicable” etc. responses as negative numbers. I would have to create a new column just to use the command replace x = Missing if x < 0?

I need to download 0.7 to explore this further.

Edit: Here I am comparing DataFrames to Stata. For what it’s worth, Stata get’s around the whole problem of missing by encoding missing numbers as +Inf. Literally any solution would be better than that, because any command related to x > Real will apply to Missing values as well.


#9

I like the current behavior, and this seems like a non-issue to me, for example

df[:col] = convert(Vector{Union{eltype(df[:col]),Missing}}, df[:col])

Seems pretty simple. In most SQL databases you need to specify whether nulls are aloud, so it’s hardly unprecedented.


#10

Well, in R you always create new columns, so it’s not completely unreasonable. You’d just do something like this:

df[:x] = ifelse.(df[:x] .< 0, missing, df[:x])

or if you want to hardcode the list of missing levels for safety:

using CategoricalArrays
df[:x] = recode(df[:x], -1 => missing, -2 => missing)

Of course, higher-level frameworks like Query or DataFramesMeta will allow you to do this with a nicer syntax.

EDIT: The recode solution currently does not work, as the returned array is allowed to contain only missing values. That could be improved. See https://github.com/JuliaData/CategoricalArrays.jl/pull/103.


#11

On second thought, I can warm to this idea. It is definitely more reproducible to not alter the underlying data by introducing missings.

Is it possible to have subtypes of missing values?

struct Refuse <: Missing ?

Stata currently offers types that display as .d, .a, or .r. However if you are working with an old dataset, It’s difficult to know what those one-letter missings mean. This is a problem I am currently having with Stata.


#12

For now there’s no AbstractMissing type, but any type can overload ismissing, which should be enough for most (if not all) purposes. I have considered creating a flavoured Missing type which would allow specifying the kind of missing values, but that’s lower in the priority list than getting the basics to work.


#13

It would be somewhat advanced but it’s not entirely implausible for the compiler to become smart enough to recognize that you’re copying a vector to another one with the same memory layout and not keeping any references to the original, which allow just reusing the memory without any copying while retaining the clearer semantics.


#14

FWIW, I’d also be in favor of defaulting into Union{Int, Missing} rather than Int. [EDIT: or allow on-the-fly conversions when one attempts to assign missing values.] I recognize there may be performance implications, but I’m not sure about the correctness argument. I guess my sense was that the aim of DataFrames was, in part, to accommodate real world data, which often has the possibility of being missing, even if there were not missing values in original read-in. Seems reasonable that by default one allows missings.

I also do frequently set things to missing in columns that don’t originally have missing data. If, for example, I realize there’s a subpopulation for which I don’t think my survey data on a given response is valid (say I realized my enumerator didn’t understand the question he was supposed to ask, or I have some GPS coordinates that turn out to have been hand-entered wrong and are in the wrong country), I want to change them to missing.

And I also frequently have data that hard-codes missings as sentinel values like -1 that I later want to swap for missings.


#15

It depends on what data you are dealing with. In a lot of cases data is stored using NaN and empty strings anyway, in which case you’d have to set up some parsing to convert them to missing anyway. In other cases the data is effectively read-only. The point is, you never really know, so shouldn’t the default behavior be not to change the type of a column rather than to change it?

Also, for what it’s worth: I’ve been using DataFrames.jl for quite a while through several manifestations, and I find that I strongly prefer there to be no special handling of the columns. It’s just so much cleaner that way. And since the columns are just AbstractVectors, and not some special data structure (to me this is basically the best feature of DataFrames.jl), why not just decide what you want when you insert the column in the first place? If you guess wrong, it’s no big deal, you can just convert it.


#16

I should have noted that you can call allowmissing!(df, col) to make a column accept missing values. I have just filed a pull request so that you can also do allowmissing!(df), which will apply this change to all columns.


#17

That’s great, thanks for pointing that out!

Here’s a related question: how hard would it be to create a promotion rule for DataFrame columns that allowed, say, an Int to be promoted to a Union{Int, Missing} when someone tries to assign a missing to an entry? That seems like it meet’s @ExplodingMan’s desire to avoid different default typing, but also addresses @mwsohn (and my) desire to make using missing seamless?


#18

That’s not really possible. When you do df[:x][1] = missing, you are calling setindex! on the vector, so DataFrame is not involved at all, and Julia (rightly) does not allow changing the type of a binding from inside a function which takes it as an argument. OTOH df[:x, 1] = missing could replace the column with a new vector accepting missing values, but that would be very surprising.


#19

Could you clarify what you mean by surprising?

From the user’s perspective, as long as missing values are propagated in a reasonable way, this wouldn’t be unexpected.


#20

I usually make all columns allow missing or pure one Type depending on the operation. For example, for first-difference which I know will lead to missing values I allow missing for all columns. For any transformation that drops the missing values I keep them one Type.