Why are missing values not ignored by default?

bkamins · November 30, 2023, 9:58am

My thoughts on this topic are the following. When I write Julia code I am in one of two modes typically:

developer, writing production code or a package;
explorer, doing quick and dirty data analysis.

I believe that the current design we have was done against the needs of “developer” user. I really do not mind having to write a more verbose code that guarantees that 2 years later when the code is read it is 100% clear what design decisions developer made when writing it (in the context of this discussion: how the developer wanted missing values to be handled).

In my opinion the “explorer” mode is currently inconvenient, especially for newcomers (but even for me when e.g. some functions expect AbstractVector{<:Real} and skipmissing does not return such object).

There are several options how the “explorer” mode could be made more convenient. Two major ones are:

adding functionality to meta-packages, where you can change the defaults or e.g. wrap the code in some macro that would substitute the functions called;
having a new package that would provide the convenience functions (with separate namespace, e.g. smean, ssum) if someone wanted to use them.

It is clear that design-wise the smean, ssum etc. functional are not clean. I would probably avoid using them in production code. However, for interactive work they are convenient. Also I think that the cost of creating such a package is low and no one would be forced to use it if someone does not want to.

The benefit of a package (against macro-based solutions) is that code snippets would be more reusable. If you cut out a line like smean(x) from a larger body of code there is no risk of thinking it was mean(x). Especially, as mentioned above, annotation of code by macros could change its behavior. Example recently discussed (run on Julia 1.9.2):

julia> using Random, Statistics

julia> Random.seed!(1234);

julia> x = rand(Float16, 10^6);

julia> mean(x)
NaN16

julia> mean(skipmissing(x))
Float16(0.0)

julia> x = rand(Float32, 10^6);

julia> mean(x)
0.5001906f0

julia> mean(skipmissing(x))
0.50017625f0

In summary - reading how much discussion this raises and judging that the effort of creating a separate package providing the s* functions is relatively low I think it does not harm to have it. If someone does not like seeing smean one can just ignore the package and not use it.

Such a package does not even be curated initially. Just someone could start developing it. After some time the community would see if it got adoption and if yes it could be moved to e.g. JuliaStats. If not - it would be a low cost failed experiment (we have had many such packages in the past and it is not a problem I think).

Topic		Replies	Views
What workflows for missing values are more ergonomic in Julia? Internals & Design	2	361	November 30, 2023
Compute mean of array where all values could be missing New to Julia	5	385	April 21, 2021
DataFrames, aggregate with missings Data dataframes	2	552	May 4, 2020
Using `isnan()` with missing values leads to hard to find bugs General Usage	6	510	April 12, 2020
Missing of a certain data type General Usage	5	481	February 15, 2019

Why are missing values not ignored by default?

Related topics