[RFC] SentinelMissings.jl


#1

Short:

I have written a small package (30 LOC) that introduces a number that that lets you treat “sentinel values” as missings in Julia https://github.com/meggart/SentinelMissings.jl (not yet registered)

Long:

I really like the new Julia solution to missing values by optimizing the Union{T,Missing}, but in some cases it has limitations. For example, the file format I am mostly dealing with (NetCDF files or Zarr datasets) have the convention to mark missing values through some sentinel value which is defined in the file’s attributes.

Since the element type does not resemble the C number types anymore, one can not simply pass a pointer to a Julia Array{Union{Missing,Float64}} to a C routine that expects a double* to write some data into. In this case I really want to avoid creating a copy of the array, because the arrays in the file can be quite large and one might run out of memory when copying the array.

Other limitations I came across were that an Array{Union{Missing,Float64}} can not be Mmapped or passed to Blosc for compression etc.

To make dealing with this easier I have written this small package and ask if there are already other attempts at implementing this functionality, if you know another package where this functionality might fit or if you find this useful at all.

A typical workflow would be:

x = [1 2 3;
  4 5 6;
  -1 -1 10]
xs = as_sentinel(x,-1)
3×3 reinterpret(SentinelMissings.SentinelMissing{Int64,-1}, ::Array{Int64,2}):
       1        2   3
       4        5   6
 missing  missing  10

Note that this does not copy the array, but operating on the reinterpret version behaves as if
the values inside were missings and it operates quite well with Array{Union{T,Missing}} types, e.g.:

a = [5.0 2.0 missing]
xs .= xs .+ a
3×3 reinterpret(SentinelMissings.SentinelMissing{Int64,-1}, ::Array{Int64,2}):
       9        6  missing
      12        9  missing
 missing  missing  missing

while the memory is still shared with x, which could be an Mmapped array or an
Array you share to a C library.

x
3×3 Array{Int64,2}:
  9   6  -1
 12   9  -1
 -1  -1  -1

#2

Cool!


#3

nifty


#4

Excellent idea


#5

Looks great. The ability to do this was an explicit design goal of the missing value stuff, so glad to see people using it.


#6

Thanks for the replies, It’s good to know that this was an explicit design goal. As a follow-up, currently it is not possible to use e.g. skipmissing in combination with this type:

julia> using SentinelMissings

julia> a = as_sentinel([1,2,-1,4],-1);

julia> sum(skipmissing(a))
missing

I think the reason is that this line https://github.com/JuliaLang/julia/blob/master/base/missing.jl#L191 checks for object identity while item === missing and only skips “true missings” and not values that should behave like missings but aren’t. Would there be a disadvantage to replace this check (and in other places like mapreduce) with ismissing(item), which would make the missing interface more generic.

I could start a PR about this, but I don’t know if this would have negative performance effects for other use cases.


#7

Yes, unfortunately ismissing can currently be less efficient (see this issue). I don’t remember whether that’s the case for that particular function, though. You could provide a custom iterate method for Base.SkipMissing{<:Base.ReinterpretArray{<:SentinelMissings.SentinelMissing}}.

But I wonder whether the approach chosen by the package is the best one. Wouldn’t it be simpler to have a SentinelMissingArray object which would wrap another array, and replace the sentinel with missing in getindex? That would be simpler from the user’s point of view.


#8

I use a similar functionality for raster datasets here: https://github.com/mkborregaard/VerySimpleRasters.jl/blob/master/src/operations.jl#L41 . vsr.mat is an mmapped Matrix. I wonder if it would be useful to use this package here, or overkill?


#9

Why not just return an actual missing from the iterator?


#10

I thought about this as well and I think for my use cases this would be the best approach. Implementing the missings this way was in some ways an experiment on how far one can use Julia’s type promotion system and to me it felt more correct because I don’t see a reason why missings represented through sentinel values should only be able to live inside arrays.

Probably I should really change the implementation to what you suggested, which is basically the same approach that @mkborregaard suggested in his post.


#11

The main reason is that I don’t control the iterator, the arrays are plain ReinterpretArrays and they return whatever is inside them when you iterate over them, which is a SentinelMissing. The reason the operations above work is that I implemented some type promotion and conversion rules that make them work well with Union{T,Missing}. In general this works well, it is just a few places that prevent you using the full functionality from base/missing.jl