Nullables - why? and how?

Apologies for what might be a daft question, but I’ve lost touch with the developments in the Julia data wrangling ecosystem over the last year or so and found myself a little puzzled when trying to use DataFrames today.

Going though some online resources it seemed to me that CSV.jl is the preferred I/O method for DataFrames these days so I went ahead and did:

using CSV, DataFrames

df = CSV.read("mydata.csv")

After which I found myself with a DataFrame populated by objects of varying interesting types, including Nullable{WeakRefString{UInt8}}, Nullable{Float64}, and others.

I since read up on Nullable.jl and WeakRefString.jl and have some understanding of the motivation behind these types, but I’m still asking myself why this is the default behaviour and how to actually work with this?

Reading through the latest DataFrames.jl docs and the DataFrames section of the Introducing Julia wikibook, I can’t find the Nullable (or WeakRefString) type discussed, and hence struggle to understand what the intended workflow is given that a lot of operations don’t seem to be defined on these types.

What version of DataFrames are you using? You may need to update it. Prior to v0.11 you can use readtable to get an older-style DataFrame that uses DataArrays as columns.

1 Like

You may find this useful:

I’m on v0.10.1 (I did Pkg.update(), but it seems that some other packages I have installed require 0.10.1 as per the release announcement linked by Tamas?).

I’ve worked out with the help of other forum posts that I can do weakrefstrings=false and nullable=false to get “normal” data in my DataFrame after reading from csv, but that doesn’t really answer my question. Presumably there’s a reason for why these options are the default behaviour, I’m just struggling to understand the benefits at this point.

Tamas, thanks for linking the release announcement which explains a bit of the background. It does however not mention Nullable (unless there’s a connection between Nullable and NA / missing which I’m missing!) - is there a reason for why this isn’t discussed at all in the docs?

Nullable is the earlier attempt that has been phased out. Search the forums for history.

I would suggest that you start using v0.11.2 of DataFrames, which is much nicer. See the topic I linked for removing packages that hold it at v0.10.1.

…or use readtable until you’re able to move to DataFrames 0.11 (but be careful about reading the docs about the version you are using).

You can also use https://github.com/davidanthoff/CSVFiles.jl. It will return the “right” missing values in a DataFrame no matter on which version of DataFrames.jl you are. So if you are on DataFrames v0.10 it will use the old style DataArray for missing values, on DataFrames v0.11 it will create a DataFrame that uses the new Missings story. If you load data into some other structure it will also give you the right missing values story for that structure (for example DataValue in an IndexedTable etc.)