I’ve come accross different types of missingness that occur when loading and preprocessing data.
NaN
Nothing
Missing
e.g.
df = groupby(DataFrame(g = [1,1,2], v = [1,2,3]), :g) |>
x-> combine(x, :v =>var)
df.v_var[2] == NaN
json = JSON3.read("""{"Nothing":null}""")
isnothing(json["Nothing"])
1 + missing == missing
This brings some problems with processing the data e.g. Arrow.jl does not play nice with the Nothing
type, coalesce does not work with NaN
. I wonder how you are dealing with these differnt types?
Thanks
- Don’t use
nothing
with data. nothing
is for functions that don’t return anything, or for undefined returns, i.e. failed regex matches.
- Use
missing
with data. missing
means, “here is a value, we just don’t know what it is”
-
NaN
is a Float64
, it’s a number. It’s not quite right to say you “don’t know what it is”, because it’s really the result of some kind of “bad” numerical calculation, i.e. 0/0
.
coalesce
is just for missing
s. The function something
is for nothing
values. Looking at it now this is not the best naming scheme, I guess.
I don’t think there exists an equivalent function for NaN
.
7 Likes
Thanks for the clarification! I’ve been using replace
quite a lot since it seems to work with all of them. But it’s kind of annoying to deal with all of them depending on the package/function used. I already wrote a small function that deals with nothing
but I guess I have to add NaN
.
Just so you know, with that function you wrote, you aren’t actually avoiding any copying.
You could just do
df[!, c] = replace(df[!, c], nothing => missing)
1 Like
Good to know! I thought that this was an in-place operation. Is there a way to do it without copying or is your suggestion the way?
No, there isn’t a to change the type of a vector to allow missing
s without copying. Maybe in the future that could be an optimization, but your Convert
call is where the copying happens, I think.
There could be cases where fields are optional, and nothing
should be used, right? “I know that there is no value here”
1 Like
Yes. But in the context of data I’m not sure that’s a very common scenario to be in. You never see NULL
with working with data frames in R for example.
I would agree. Coming from R I have never seen NULL used in tabular data. But maybe my usecase is specific in that I think in terms of a model matrix s.t. it does not really matter why a value is missing. In any case the row will be removed or I would like to replace it with some value.
The problem is that one still has to check whether there are nothing
s in the column otherwise missings are allowed implicitly
julia> typeof(replace([1,2,3], nothing => missing))
Array{Union{Missing, Int64},1}
Yeah, you still need that check.
My lazy version of that is
if eltype(df[!, col]) isa Union
...
end
The more involved version to read the types of a union into a vector
function readtypes(U::Union, types = DataType[])
push!(types, getfield(U, :a))
if isa(getfield(U, :b), DataType)
push!(types, getfield(U, :b))
return types
else
readtypes(U.b, types)
end
end
I asked a similar question on Stack Overflow that received some helpful answers you may find useful.
1 Like
If your goal is just strip away any missing
, nothing
, and NaN
entries …
notmissing = Base.Fix2(!==, missing)
notnothing = Base.Fix2(!==, nothing)
notnan(x) = true
notnan(x::Base.IEEEFloat) = !isnan(x)
isavailable(x) = notmissing(x) && notnothing(x) && notnan(x)
clean(data) = filter(isavailable, data)
If you want the indices of each missing
, nothing
, and NaN
function unavailable(data)
idxs = []
for (i,x) in enumerate(data)
!isavailable(x) && push!(idxs, i)
end
return idxs
end
1 Like