Why are missing values not ignored by default?

My thoughts on this topic are the following. When I write Julia code I am in one of two modes typically:

  • developer, writing production code or a package;
  • explorer, doing quick and dirty data analysis.

I believe that the current design we have was done against the needs of “developer” user. I really do not mind having to write a more verbose code that guarantees that 2 years later when the code is read it is 100% clear what design decisions developer made when writing it (in the context of this discussion: how the developer wanted missing values to be handled).

In my opinion the “explorer” mode is currently inconvenient, especially for newcomers (but even for me when e.g. some functions expect AbstractVector{<:Real} and skipmissing does not return such object).

There are several options how the “explorer” mode could be made more convenient. Two major ones are:

  • adding functionality to meta-packages, where you can change the defaults or e.g. wrap the code in some macro that would substitute the functions called;
  • having a new package that would provide the convenience functions (with separate namespace, e.g. smean, ssum) if someone wanted to use them.

It is clear that design-wise the smean, ssum etc. functional are not clean. I would probably avoid using them in production code. However, for interactive work they are convenient. Also I think that the cost of creating such a package is low and no one would be forced to use it if someone does not want to.

The benefit of a package (against macro-based solutions) is that code snippets would be more reusable. If you cut out a line like smean(x) from a larger body of code there is no risk of thinking it was mean(x). Especially, as mentioned above, annotation of code by macros could change its behavior. Example recently discussed (run on Julia 1.9.2):

julia> using Random, Statistics

julia> Random.seed!(1234);

julia> x = rand(Float16, 10^6);

julia> mean(x)
NaN16

julia> mean(skipmissing(x))
Float16(0.0)

julia> x = rand(Float32, 10^6);

julia> mean(x)
0.5001906f0

julia> mean(skipmissing(x))
0.50017625f0

In summary - reading how much discussion this raises and judging that the effort of creating a separate package providing the s* functions is relatively low I think it does not harm to have it. If someone does not like seeing smean one can just ignore the package and not use it.

Such a package does not even be curated initially. Just someone could start developing it. After some time the community would see if it got adoption and if yes it could be moved to e.g. JuliaStats. If not - it would be a low cost failed experiment (we have had many such packages in the past and it is not a problem I think).

20 Likes