Dealing with different concepts of "missingness"

I’ve come accross different types of missingness that occur when loading and preprocessing data.

  1. NaN
  2. Nothing
  3. Missing

e.g.

df = groupby(DataFrame(g = [1,1,2], v = [1,2,3]), :g) |>
          x-> combine(x, :v =>var)
df.v_var[2] == NaN
json = JSON3.read("""{"Nothing":null}""")
isnothing(json["Nothing"])
1 + missing == missing

This brings some problems with processing the data e.g. Arrow.jl does not play nice with the Nothing type, coalesce does not work with NaN. I wonder how you are dealing with these differnt types?

Thanks

  1. Don’t use nothing with data. nothing is for functions that don’t return anything, or for undefined returns, i.e. failed regex matches.
  2. Use missing with data. missing means, “here is a value, we just don’t know what it is”
  3. NaN is a Float64, it’s a number. It’s not quite right to say you “don’t know what it is”, because it’s really the result of some kind of “bad” numerical calculation, i.e. 0/0.

coalesce is just for missings. The function something is for nothing values. Looking at it now this is not the best naming scheme, I guess.

I don’t think there exists an equivalent function for NaN.

7 Likes

Thanks for the clarification! I’ve been using replace quite a lot since it seems to work with all of them. But it’s kind of annoying to deal with all of them depending on the package/function used. I already wrote a small function that deals with nothing but I guess I have to add NaN .

Just so you know, with that function you wrote, you aren’t actually avoiding any copying.

You could just do

df[!, c] = replace(df[!, c], nothing => missing)
1 Like

Good to know! I thought that this was an in-place operation. Is there a way to do it without copying or is your suggestion the way?

No, there isn’t a to change the type of a vector to allow missings without copying. Maybe in the future that could be an optimization, but your Convert call is where the copying happens, I think.

There could be cases where fields are optional, and nothing should be used, right? “I know that there is no value here”

1 Like

Yes. But in the context of data I’m not sure that’s a very common scenario to be in. You never see NULL with working with data frames in R for example.

I would agree. Coming from R I have never seen NULL used in tabular data. But maybe my usecase is specific in that I think in terms of a model matrix s.t. it does not really matter why a value is missing. In any case the row will be removed or I would like to replace it with some value.

The problem is that one still has to check whether there are nothings in the column otherwise missings are allowed implicitly

julia> typeof(replace([1,2,3], nothing => missing))
Array{Union{Missing, Int64},1}

Yeah, you still need that check.

My lazy version of that is

if eltype(df[!, col]) isa Union     
...
end

The more involved version to read the types of a union into a vector

function readtypes(U::Union, types = DataType[])
    push!(types, getfield(U, :a))
    if isa(getfield(U, :b), DataType)
        push!(types, getfield(U, :b))
        return types
    else
        readtypes(U.b, types)
    end
end

I asked a similar question on Stack Overflow that received some helpful answers you may find useful.

1 Like

If your goal is just strip away any missing, nothing, and NaN entries …

notmissing = Base.Fix2(!==, missing)
notnothing = Base.Fix2(!==, nothing)
notnan(x) = true
notnan(x::Base.IEEEFloat) = !isnan(x)
isavailable(x) = notmissing(x) && notnothing(x) && notnan(x)
clean(data) = filter(isavailable, data)

If you want the indices of each missing, nothing, and NaN

function unavailable(data)
  idxs = []
  for (i,x) in enumerate(data)
    !isavailable(x) && push!(idxs, i)
  end
  return idxs
end
1 Like