Plotting options for DataFrames with NullableArray

question

#1

I have been transitioning to Nullables in my data analysis code, then belatedly realized that neither Gadfly nor Plots.jl work with current master DataFrames. Is there a plotting package that is usable at the moment?



#2

Here is a workaround:

"""
Remove missing values from a dataframe, and make sure that all the
columns are plain vanilla Vectors, not NullableVectors.

Horrible hack, until plotting libraries catch up.
"""
function sanitize_df(df)
    df = df[complete_cases(df),:]
    colnames = names(df)
    _column(name) = [get(x) for x in df[name]]
    ## relying on https://github.com/JuliaStats/DataFrames.jl/issues/1119
    ## in the spirit of https://xkcd.com/1172/
    DataFrame(map(_column, colnames), colnames)
end

#3

Why do you think that is a horrible hack? StatPlots have always assumed DataFrames had complete cases (the alternative might be to convert NAs/Nulls to NaN when the eltype can be promoted to Float64), and in fact your suggestion is pretty much what is recommended by @nalimilan here: https://github.com/JuliaStats/DataFrames.jl/issues/1148
EDIT: I see now. It is ‘horrible’ because you rely on this DataFrame constructor that avoids Nullables https://github.com/JuliaStats/DataFrames.jl/blob/bf0bda80a1c4e24f0fdc0a547883ea6e285de53d/src/dataframe/dataframe.jl#L51


#4

I guess another issue with this approach is that some more or less irrelevant column, which is not used in the plot, could have plenty of missing values. You don’t want to lose data because of that but also you don’t want to think about which columns are involved every time you do a plot.
At least as far as StatPlots is concerned, it seems like one could do better. This macro replacing a dataframe and a list of symbols with the respective columns could also discard all lines where at least one of the involved columns has a NULL and replace each symbol with the respective column converted to array.

The whole groupapply stuff is probably a bit trickier, but I’m waiting for this DataFrames situation to stabilize to understand what needs to be done. Also there, a simple solution would be to add a few lines of code at the beginning which discard lines where in one relevant column there’s a NULL. My only worry is that, even when starting with a dataframe of arrays, maybe some operations (say by or subselecting the DataFrame, or adding columns) could restart producing NullableArrays. To be honest though, I really hope that there will be a way to simply avoid dealing with NullableArrays when you don’t have missing data.


#5

Yes, exactly, my suggestion ATM is to use that recipe to discard all non-complete cases of those variables that are referenced inside the call and then replace the variables with normal arrays. The group_apply recipe should be able to use the same function to handle the df.


#6

I have a couple of things to say:

  • For StatPlots, the proper solution is likely to have a “type recipe” for NullableArray, which does a consistent conversion of some sort to a type that is plottable. Then a DataFrame can be decomposed into a bunch of NullableArrays, and the type recipe can handle further conversions
  • Removing rows that contain nulls is likely not ideal. There are plenty of times when missing data can (and should) have an affect on the visualization. The “correct” way to deal with this is likely to map to NaN for numeric data, and to add native handling of Nullables when processing discrete inputs within Plots (which internally would add NaN to the numeric mapping discrete–>continuous)

Of course, getting the no-null case working should be the first priority, so any hack is fine.


#7
  • Why not just update the current type recipe for DataFrames, replacing collect(df[:x]) calls with get calls (and the other handling described here (no need to use the ‘horrible’ hack of rebuilding the DataFrame))?
  • Yes, exactly, that is what I meant by converting to NaN for types that can be promoted to Float64. I probably didn’t make it clear enough, nor realized that discrete types could be replaced by NaN inside Plots.

#8

@Tamas_Papp please check out and comment on the Nullable_DataFrames branch of StatPlots :slight_smile: It should work for all cases except for when the DataFrame column passed to plot is not numeric AND contains Nulls


#9

EDIT: missed the latest commit. Now it works fine.


#10

Cool, that is the current recommendation for plotting DataFrames with Nullables with Plots, then.