Dealing with different concepts of "missingness"

danielw2904 · February 10, 2021, 4:31pm

I’ve come accross different types of missingness that occur when loading and preprocessing data.

NaN
Nothing
Missing

e.g.

df = groupby(DataFrame(g = [1,1,2], v = [1,2,3]), :g) |>
          x-> combine(x, :v =>var)
df.v_var[2] == NaN
json = JSON3.read("""{"Nothing":null}""")
isnothing(json["Nothing"])
1 + missing == missing

This brings some problems with processing the data e.g. Arrow.jl does not play nice with the Nothing type, coalesce does not work with NaN. I wonder how you are dealing with these differnt types?

Thanks

pdeffebach · February 10, 2021, 4:41pm

Don’t use nothing with data. nothing is for functions that don’t return anything, or for undefined returns, i.e. failed regex matches.
Use missing with data. missing means, “here is a value, we just don’t know what it is”
NaN is a Float64, it’s a number. It’s not quite right to say you “don’t know what it is”, because it’s really the result of some kind of “bad” numerical calculation, i.e. 0/0.

coalesce is just for missings. The function something is for nothing values. Looking at it now this is not the best naming scheme, I guess.

I don’t think there exists an equivalent function for NaN.

danielw2904 · February 10, 2021, 4:59pm

Thanks for the clarification! I’ve been using replace quite a lot since it seems to work with all of them. But it’s kind of annoying to deal with all of them depending on the package/function used. I already wrote a small function that deals with nothing but I guess I have to add NaN .

pdeffebach · February 10, 2021, 5:03pm

Just so you know, with that function you wrote, you aren’t actually avoiding any copying.

You could just do

df[!, c] = replace(df[!, c], nothing => missing)

danielw2904 · February 10, 2021, 5:12pm

Good to know! I thought that this was an in-place operation. Is there a way to do it without copying or is your suggestion the way?

pdeffebach · February 10, 2021, 5:16pm

No, there isn’t a to change the type of a vector to allow missings without copying. Maybe in the future that could be an optimization, but your Convert call is where the copying happens, I think.

lostella · February 10, 2021, 5:31pm

There could be cases where fields are optional, and nothing should be used, right? “I know that there is no value here”

pdeffebach · February 10, 2021, 5:39pm

Yes. But in the context of data I’m not sure that’s a very common scenario to be in. You never see NULL with working with data frames in R for example.

danielw2904 · February 10, 2021, 6:42pm

I would agree. Coming from R I have never seen NULL used in tabular data. But maybe my usecase is specific in that I think in terms of a model matrix s.t. it does not really matter why a value is missing. In any case the row will be removed or I would like to replace it with some value.

danielw2904 · February 10, 2021, 6:54pm

The problem is that one still has to check whether there are nothings in the column otherwise missings are allowed implicitly

julia> typeof(replace([1,2,3], nothing => missing))
Array{Union{Missing, Int64},1}

pdeffebach · February 10, 2021, 6:57pm

Yeah, you still need that check.

danielw2904 · February 10, 2021, 7:00pm

My lazy version of that is

if eltype(df[!, col]) isa Union     
...
end

The more involved version to read the types of a union into a vector

function readtypes(U::Union, types = DataType[])
    push!(types, getfield(U, :a))
    if isa(getfield(U, :b), DataType)
        push!(types, getfield(U, :b))
        return types
    else
        readtypes(U.b, types)
    end
end

Nathan_Boyer · February 10, 2021, 7:07pm

I asked a similar question on Stack Overflow that received some helpful answers you may find useful.

JeffreySarnoff · February 10, 2021, 7:10pm

If your goal is just strip away any missing, nothing, and NaN entries …

notmissing = Base.Fix2(!==, missing)
notnothing = Base.Fix2(!==, nothing)
notnan(x) = true
notnan(x::Base.IEEEFloat) = !isnan(x)
isavailable(x) = notmissing(x) && notnothing(x) && notnan(x)
clean(data) = filter(isavailable, data)

If you want the indices of each missing, nothing, and NaN

function unavailable(data)
  idxs = []
  for (i,x) in enumerate(data)
    !isavailable(x) && push!(idxs, i)
  end
  return idxs
end

Topic		Replies	Views
A few questions on Julia's missing values, and how they compare to Python and R New to Julia nan	3	622	February 25, 2021
Can't Convert 'Nothing' to 'Missing' General Usage question	4	2008	June 30, 2020
Replacing missing and NaN values in dataframe New to Julia question , dataframes , missing-values	6	4079	March 29, 2022
Is there a coalesce function for other types? General Usage question , data	13	2593	April 5, 2024
Missing or NaN General Usage	26	12328	August 1, 2018

Dealing with different concepts of "missingness"

Related topics