The new “missing” type introduced in DataFrames is a source of many errors in all the Julia programs I have made previously and I would like to find a way to remove this missing type.
For example the competerank() function from package MultipleTesting is not compatible with missing type.
the command r = competerank(pv)
now gives and error :
MethodError: no method matching competerank(::Array{Union{Float64, Missings.Missing},1})
I tried to replace it by : r = competerank(skipmissing(pv))
I’m not quite sure what’s being asked here, but I suggest that if your dataset really does contain fields of missing data that you define the appropriate functions for the Missing type. After all, that’s the whole reason why Missing exists.
Note that skipmissing returns an iterator, not an array. If you want an array you can do collect(skipmissing(pv)). If you want to replace the missing values (you seemed to imply that you’d want to replace them with floats) you can do Missings.replace(pv, x), but be aware that replace also returns an iterator so if you want an array you again have to do collect.
It’s too bad that competerank doesn’t accept any iterable, but that limitation comes from sortperm. I guess it would be possible to write a version which automatically collects into an array for convenience. Could you file an issue against StatsBase?
Note that this is not special to missing, it also occurred with NA AFAICT.
with Julia 0.5 and corresponding version of DataFrames I could read my data into a DataFrame and then use directly a column of the table (pv in my example) in statistical functions such as competerank()
I did a quick test following the suggestion of @ExpandingMan, and it seems that the compatibility is restored using this line
pv = collect(skipmissing(pv))
at least for the functions
competerank()
adjust()
there is may be a slowdown but my priority is to keep my old programs working. Thanks !
You should find that for the most part, with the latest version of dataframes you can again “use directly a column of the table”, but some functions simply don’t make sense in the presence of missing, and their behavior must be defined in these cases.
I would be interested to know what the philosophy of StatsBase is toward Missing. Is lifting considered to be a default behavior?
I see, this was actually possible because DataArray lied about its element type and pretended it could not contain missing values. That situation had to be fixed: functions which support missing values now have to opt-in to accept such arrays, or the user needs to remove missing values manually.
At this point the philosophy is that you need to skip missing values manually before calling StatsBase functions. I guess some functions could accept missing values and return missing if they find one, but that’s not terribly useful so that’s low priority.