Data Frames for non null data


#1

Is there a “dataframes” for data that you know will not be null? I have used both datatables and dataframes and like their syntactic sugar, but I find myself fussing with nullable arrays or missing.missing issues too much.

I am familiar with the dropna functionality, but is there somewhere where you can set a flag once (on data import for instance) and move on?


#2

It isn’t clear to me what you mean. For example, all columns of a DataFrame have their own type, which is essentially Vector{T}. Now, if that column contains all floating point numbers, its type will be Vector{Float64}. If it also contains missing values, the type will be Vector{Union{Float64, Missing}}. If instead of missing it contains NA then the type will be DataVector{Float64}, which is very similar to the previous example. Note that this is all at the column level, not the DataFrame level. Additionally, the type is generally inferred by the compiler, so if no missing or NA values ever appear in the data or in your code, the columns should be correctly typed anyway.

Are you just asking how to convert a column without missing values to a concretely typed object (e.g., Vector{Union{Float64, Missing}} to Vector{Float64})?


#3

If you have no missing data in your file you can set an option like nullable=false in CSV.jl when reading your file.


#4

Note that DataFrames used to convert columns to DataArray automatically in versions before 0.11, so make you don’t use old versions. In particular, remove DataTables or DataFrames will remain stuck at 0.10.1. See How to upgrade from DataFrames 0.10.1 to 0.11.3?


#5

Thanks for the great replies.

I ran into these problems after I did a package update for the first time in a few months. I will check out @nalimilan’s posted link to make sure I have dataframes updated properly.

When I read in a file the dataframe is using the Union{Float64,Missing} type even though there is no missing data. Since I observed the behavior even though I had no missing data it looked (to me at least) like this was dataframe’s new behavior. UPDATE: I observed this behavior with readtable not CSV.read(). CSV.read appears to work as described by @tbeason.

When I get back to work on Monday I will try to set the nullable=false flag in CSV to see if that is a good enough hint for the compiler to choose the correct type.

UPDATE 2:
I set the nullable flag in CSV read and everything is working fine.

Thanks again everyone.