Plotting options for DataFrames with NullableArray

Tamas_Papp · January 19, 2017, 10:18am

I have been transitioning to Nullables in my data analysis code, then belatedly realized that neither Gadfly nor Plots.jl work with current master DataFrames. Is there a plotting package that is usable at the moment?
https://github.com/GiovineItalia/Gadfly.jl/issues/909
https://github.com/JuliaPlots/StatPlots.jl/issues/33

Tamas_Papp · January 19, 2017, 10:48am

Here is a workaround:

"""
Remove missing values from a dataframe, and make sure that all the
columns are plain vanilla Vectors, not NullableVectors.

Horrible hack, until plotting libraries catch up.
"""
function sanitize_df(df)
    df = df[complete_cases(df),:]
    colnames = names(df)
    _column(name) = [get(x) for x in df[name]]
    ## relying on https://github.com/JuliaStats/DataFrames.jl/issues/1119
    ## in the spirit of https://xkcd.com/1172/
    DataFrame(map(_column, colnames), colnames)
end

mkborregaard · January 19, 2017, 4:09pm

Why do you think that is a horrible hack? StatPlots have always assumed DataFrames had complete cases (the alternative might be to convert NAs/Nulls to NaN when the eltype can be promoted to Float64), and in fact your suggestion is pretty much what is recommended by @nalimilan here: https://github.com/JuliaStats/DataFrames.jl/issues/1148
EDIT: I see now. It is ‘horrible’ because you rely on this DataFrame constructor that avoids Nullables https://github.com/JuliaStats/DataFrames.jl/blob/bf0bda80a1c4e24f0fdc0a547883ea6e285de53d/src/dataframe/dataframe.jl#L51

piever · January 19, 2017, 4:44pm

I guess another issue with this approach is that some more or less irrelevant column, which is not used in the plot, could have plenty of missing values. You don’t want to lose data because of that but also you don’t want to think about which columns are involved every time you do a plot.
At least as far as StatPlots is concerned, it seems like one could do better. This macro replacing a dataframe and a list of symbols with the respective columns could also discard all lines where at least one of the involved columns has a NULL and replace each symbol with the respective column converted to array.

The whole groupapply stuff is probably a bit trickier, but I’m waiting for this DataFrames situation to stabilize to understand what needs to be done. Also there, a simple solution would be to add a few lines of code at the beginning which discard lines where in one relevant column there’s a NULL. My only worry is that, even when starting with a dataframe of arrays, maybe some operations (say by or subselecting the DataFrame, or adding columns) could restart producing NullableArrays. To be honest though, I really hope that there will be a way to simply avoid dealing with NullableArrays when you don’t have missing data.

mkborregaard · January 19, 2017, 4:50pm

Yes, exactly, my suggestion ATM is to use that recipe to discard all non-complete cases of those variables that are referenced inside the call and then replace the variables with normal arrays. The group_apply recipe should be able to use the same function to handle the df.

tbreloff · January 19, 2017, 5:06pm

I have a couple of things to say:

For StatPlots, the proper solution is likely to have a “type recipe” for NullableArray, which does a consistent conversion of some sort to a type that is plottable. Then a DataFrame can be decomposed into a bunch of NullableArrays, and the type recipe can handle further conversions
Removing rows that contain nulls is likely not ideal. There are plenty of times when missing data can (and should) have an affect on the visualization. The “correct” way to deal with this is likely to map to NaN for numeric data, and to add native handling of Nullables when processing discrete inputs within Plots (which internally would add NaN to the numeric mapping discrete–>continuous)

Of course, getting the no-null case working should be the first priority, so any hack is fine.

mkborregaard · January 19, 2017, 6:29pm

Why not just update the current type recipe for DataFrames, replacing collect(df[:x]) calls with get calls (and the other handling described here (no need to use the ‘horrible’ hack of rebuilding the DataFrame))?
Yes, exactly, that is what I meant by converting to NaN for types that can be promoted to Float64. I probably didn’t make it clear enough, nor realized that discrete types could be replaced by NaN inside Plots.

mkborregaard · January 19, 2017, 10:20pm

@Tamas_Papp please check out and comment on the Nullable_DataFrames branch of StatPlots It should work for all cases except for when the DataFrame column passed to plot is not numeric AND contains Nulls

Tamas_Papp · January 20, 2017, 11:52am

EDIT: missed the latest commit. Now it works fine.

mkborregaard · January 20, 2017, 8:08pm

Cool, that is the current recommendation for plotting DataFrames with Nullables with Plots, then.

Topic		Replies	Views
Ignoring missing data when plotting Visualization	12	11711	November 27, 2019
Announcement: An Update on DataFrames Future Plans Data announcement	41	9248	December 27, 2017
Heatmap of a dataframe Visualization	12	3076	June 15, 2017
Issue with DataFrames, operations on DataFrames now return Nullable Arrays? General Usage	5	1903	July 19, 2017
How to deal with a Nullable DataFrame? General Usage	3	397	June 7, 2019

Plotting options for DataFrames with NullableArray

Related topics