Which should I use, nothing or NA for DataFrames?


#1

I’m using Julia 1.0.3 and have loaded a csv file but found that the fields are not of the appropriate type.

Therefore, I’ve changed some columns of integer data type into strings, as an example

map(x -> ismissing(x) ? NA : convert(String, x), df[:Column1])

But when I tried to parse strings into Float64, and for the sake of consistency, change the default nothing into NA using

map(x-> (v = tryparse(Float64,x); v == nothing ? NA : v), csv[:recency])

I get the error UndefVarError: NA not defined

However, if I stick to nothing, I feel uncomfortable know that my column is of type: Array{Union{Nothing, Float64},1}. A mix of 2 data types. I fear that the mixture of data types may lead to issues further down in my programme. At the same time, I am unable to change nothing to NA.

Any advice?


#2

Neither, use missing:
https://docs.julialang.org/en/v1/manual/missing/

This is unwarranted, using small unions is now supported.


#3

There is also a very nice blog post for describing the reasoning behind missing in Julia: https://julialang.org/blog/2018/06/missing


#4

Thanks @Tamas_Papp, for direclty answering the question and also addressing the concern about in data type incompatibility.


#5

Thanks ValdarT, this is indeed a good summary. If anyone’s interested, some of the key points are that:

1.missing is analagous to NULL in sql and NA in R
2. missing is similar to its predecessor NA (in Julia)
3. makes it easy to generate sql requests in Julia and interoperate with R