Why are missing values not ignored by default?

yes, and lambdas if you need something a little more complicated. And if you need to reuse them over and over, just bind them using let

let skmean = mean ∘ skipmissing
....
   foo = skmean(bar)
   baz = skmean(quux)
end

However, it’s already been shown that this can change the algorithm and accumulate more roundoff errors which is a considerable issue!

I do think there’s a flavor in this discussion of “common workflow was built around the norms of Python, and why doesn’t Julia preserve those norms?” when in fact Julia made some changes to the norms specifically because of design decisions that would make sense for people coming from physical sciences and Lisp, and it’s not necessarily optimal to continue to utilize the workflow that was good in Python or R. The full spectrum of available alternatives are not necessarily obvious, particularly if you haven’t programmed in Lisp or Haskell or something else that has a more functional flavor.

3 Likes

guys I use plenty of Julia, I’m not just some tourist from Python

I’m aware of the tools that Julia has to offer.

is it really so hard to believe that this use case is just not as ergonomic as it could be in Julia?

Honestly, the use case just doesn’t arise for me, and I’m wondering why, in a completely innocent and not trying to be accusatory or dismissive way. I honestly don’t KNOW why this non-ergonomic case hasn’t been something I’ve been annoyed by. So far the only thing I can think of is that I process data such that I don’t generate “missing not applicable” values and I almost always have some imputation scheme surrounding “missing, value not known” values. I also tend to munge data with DataFramesMeta and to work primarily with individual data points rather than summary statistics…

I am honestly not accusing you of anything here, and I urge you to think of my musings as a guy talking aloud trying to figure out what to think of the question. :male_detective:

Suppose I were trying to write a set of course notes. Could I write some notes which would tell a new user how to do things in a way which will feel ergonomic? That’s my goal.

You might say “but I don’t want to be told to do things differently” and that’s FINE but if someone doesn’t have a workflow already could we describe one that avoids non-ergonomic issues? I honestly believe others are in the same track as me too.

Perhaps I should branch this off on a separate issue… Then I’m not hijacking your issue? I’ll figure that out

Ok done I’ve got a separate thread linked above.

3 Likes

Imagine you have a dataset where you have name, age, gender, salary, etc. contained. The information is voluntary, and people decide randomly whether not to disclose some information. There’s nothing you can do about the missing observations, and you’re aware of those missing values. Moreover, whatever the reason, having imputations is not what you want or what people typically do in your field of work.

You load the dataset, you’re aware of these missing values, and there’s nothing you can do about it. And you’ll have to work for 8 hours with this dataset. During those 8 hours, every two lines you have to think again about missing values, because Julia is telling you “did you realize you have missing values?” and you need to answer “yes, Julia, I don’t care about those missing values, please ignore them”.

That’s when you say “Man, I wish I could have the option of declaring these missing values at the very beginning as a special type of value that could be ignored”. If you work for one hour, it’s not a problem. Now, imagine how you feel after Julia interrupts you with that question after 6 hours… In fact, my original post was a comment after I had to deal with this feature for 8 hours.
On top of this, the very next day, you work on the same dataset and have to re-read your code. This contains lines and words added everywhere only to handle those missing values, hindering readability when you have 3,000 lines of code.

Overall, the whole point is not whether you can use imputations nor whether there are ways to easily handle missing values. The point is that for this type of work, missing values are hindering your work, deviating your attention during 8 hours on the analysis you’re conducting.

And it’s neither about the importance of missing values as they are. For some types of work, I want exactly this behavior: in some contexts, I don’t want to ignore missing values and need to carefully consider them. For instance, in my type of work, NaNs are almost always indicating that I’m doing something wrong.

It’s the best I can explain it! hahah if you still can’t see where we’re coming from, the only way to understand it is by you experiencing it. And this would only happen if you do a similar type of work (I guess it’s not a coincidence that people complaining about this do economics/finance).

9 Likes

So I won’t get into my personal ideas about:

Because i have a lot of those very strongly held opinions, but this is more or less for another day and another context.

So, what does it take to do this? I’m thinking out loud:

struct Ignorable end
const ignore = Ignorable()

isignorable(x) = typeof(x) == Ignorable

function miss2ignore!(df)
   for i in 1:ncol(df)
       df[!,i] = replace(df[!,i], missing => ignore)
   end
end

Ok, so now it’s pretty easy to just replace all the missing with ignore. I guess the next part is rewriting summary stats functions? They’d look like:

function mean(x::Vector{Union{Ignorable,T}}) where T
    xig = filter(!(isignorable),x)
   mean(xig)
end
...

This pattern is basically the same for every unitary summary function, so it’d be good to have a macro @declareignore1 that writes this code for each element of an array, then you could say

@declareignore1 [:mean,:median,:sum,:prod...]

For two argument summary functions like cor, a different macro would be needed presumably to filter pairs where either one or both of the entries are ignorable. Also seems pretty straightforward, you might get away with preallocating a pair of vectors, then iterating through and skipping over the ignorable entries… then resize! the vectors, then run the function

function cor(x::Vector{Union{Ignorable,T}},y::Vector{Union{Ignorable,T}}) where T
   xx = Vector{T}(undef,length(x)) 
   yy = Vector{T}(undef,length(y)) 
  i = 1
  for j in eachindex(x)
     if isignorable(x[j]) || isignorable(y[j])
        continue
     else
         xx[i] = x[j]
         yy[i] = y[j]
         i++
     end
   end
   resize!(xx,i-1) 
   resize!(yy,i-1)
   cor(xx,yy)
end

So basically write a macro that writes that … and then declare your cor and cov and whatever. (You could also do this somewhat functionally, create a function that does the filtering and then declare cor and cov as just calling the filtering function and then applying cor and cov etc.

I think after about 50-100 lines of code in either a small package or just a utility script you can include, you’ve got all the tools you’d normally need?

I guess now your big problem will come with regular arithmetic and comparisons and such?

One can have missing values when using multiple time series, which are then aligned in ‘asof’ joins; there’s nothing wrong with the sensors but the missing values are created by differences in the sampling timestamps.

However, I find it easiest to use duckdb for the workflow as much possible and only materialize the final results in-memory as a dataframe at the end, and the aggregate functions ignore nulls by default. With the modern expressions like the as of joins, the complex structures you a create within a dataframe cell, and the new dot notation, I think it’s getting easier to write as much as possible in a language-neutral format and push more data processing out of memory.

DuckDB.execute(con, "from df 
                     select x1.mean() filter(x1 > 0.5), 
                            corr(x1, x2)")
1 Like

I don’t think people are dismissing your use case, it’s just all the proposed automatic handlings of missing do not fit their use cases, so obviously they’d be against making those behaviors the default, as the title of the thread says. They are as justified to argue for making their ways of handling missings the default for their own convenience, but they elected not to dismiss other ways as not data science.

You even offered an example of both in the same method: this does imputation for a=missing but propagates b=missing. It’s hard to justify making this the default instead of ignoring b=missing instead, or ignoring both, or imputing true (“can’t refute it ever”) instead of false (“can’t prove it ever”).

It seems like you have a problem with the dataset rather than the operations or missing semantics, and it would be simpler to modify the dataset than make new versions of operations and missings. Is there really nothing you can do in advance of the boilerplate every couple lines? dropmissing already does listwise deletion, even lazily and within select columns. Maybe we can do something similar for pairwise deletion with some column-wise skipmissing setting (that is, skipmissing is automatically done on indexing). groupby and subset already have a skipmissing setting to handle missings in their particular ways. As for imputation, it’s hard for me to imagine some setting because what value you insert at what phase of what operation is just too many factors. For the ishigheeq examples at least, a higher order function can avoid multiple method definitions:

julia> falsefirstmissing(op, a, b) = !ismissing(a) && op(a, b)
falsefirstmissing (generic function with 1 method)

julia> const ffm = falsefirstmissing
falsefirstmissing (generic function with 1 method)

julia> ffm.(>=, [missing, 1, 1], [1, 1, missing])
3-element Vector{Union{Missing, Bool}}:
 false
  true
      missing

julia> [missing, 1, 1] .>= [1, 1, missing] # compared to this
3-element Vector{Union{Missing, Bool}}:
     missing
 true
     missing
1 Like

There is also my Imputation module in BetaML that imputes (optionally multiple times) using random forests or any other model with a m = Model(); fit!(m,x,y); predict(m,x) API.

“Automatic Imputation” then becomes as simple as X_imputed = fit!(RFImputer(),X_with_missing_values)

2 Likes

I think that’s very fair to ask, especially since Julia has NaNs that behave exactly like that:

julia> !(1 ≤ NaN)
true

julia> NaN ≤ 1
false

I think it’s more natural for missing to propagate in this case but as you say that is a design choice.

1 Like

At this point I think it would really help to have a few concrete comparisons of Julia vs other languages, with runnable code. Then we can experiment with actual solutions.

So far I think all we have is:

This would be well served with the proposal of short skipping functions for the common cases, e.g. skcor here (and this example doesn’t look good for polars when the user wants to propagate missings).

1 Like

My thoughts on this topic are the following. When I write Julia code I am in one of two modes typically:

  • developer, writing production code or a package;
  • explorer, doing quick and dirty data analysis.

I believe that the current design we have was done against the needs of “developer” user. I really do not mind having to write a more verbose code that guarantees that 2 years later when the code is read it is 100% clear what design decisions developer made when writing it (in the context of this discussion: how the developer wanted missing values to be handled).

In my opinion the “explorer” mode is currently inconvenient, especially for newcomers (but even for me when e.g. some functions expect AbstractVector{<:Real} and skipmissing does not return such object).

There are several options how the “explorer” mode could be made more convenient. Two major ones are:

  • adding functionality to meta-packages, where you can change the defaults or e.g. wrap the code in some macro that would substitute the functions called;
  • having a new package that would provide the convenience functions (with separate namespace, e.g. smean, ssum) if someone wanted to use them.

It is clear that design-wise the smean, ssum etc. functional are not clean. I would probably avoid using them in production code. However, for interactive work they are convenient. Also I think that the cost of creating such a package is low and no one would be forced to use it if someone does not want to.

The benefit of a package (against macro-based solutions) is that code snippets would be more reusable. If you cut out a line like smean(x) from a larger body of code there is no risk of thinking it was mean(x). Especially, as mentioned above, annotation of code by macros could change its behavior. Example recently discussed (run on Julia 1.9.2):

julia> using Random, Statistics

julia> Random.seed!(1234);

julia> x = rand(Float16, 10^6);

julia> mean(x)
NaN16

julia> mean(skipmissing(x))
Float16(0.0)

julia> x = rand(Float32, 10^6);

julia> mean(x)
0.5001906f0

julia> mean(skipmissing(x))
0.50017625f0

In summary - reading how much discussion this raises and judging that the effort of creating a separate package providing the s* functions is relatively low I think it does not harm to have it. If someone does not like seeing smean one can just ignore the package and not use it.

Such a package does not even be curated initially. Just someone could start developing it. After some time the community would see if it got adoption and if yes it could be moved to e.g. JuliaStats. If not - it would be a low cost failed experiment (we have had many such packages in the past and it is not a problem I think).

20 Likes

Should point out that comparisons that don’t propagate NaN is a IEEE 754 floating point standard, not something particular to Julia.

julia> mycall(f, args...) = f(args...);

julia> mycall.((>, <=, >=, <, ==, !=), NaN, 1.0)
(false, false, false, false, false, true)

julia> mycall.((>, <=, >=, <, ==, !=), 1.0, NaN)
(false, false, false, false, false, true)

julia> mycall.((>, <=, >=, <, ==, !=), NaN, NaN)
(false, false, false, false, false, true)

The logic is precisely the aforementioned unordered relations being considered false, in other words NaN is incomparable to any value. != is not an actual exception. With comparable elements, either >= or <= must be true, and != just negates == (both >= and =<); when involving NaN, >= and <= both being false doesn’t change what != means. Practically, x!=x retains the same ability of x==x to implement isnan(x).

Evidently, these semantics are NOT shared by the propagating missing. This makes sense because NaN extends totally ordered floating points to a partial order, but missing can extend sets with no partial order; relations like >= or =< won’t exist, let alone be false. Could you implement missing to be incomparable to totally ordered types, sure, but then you’d sacrifice a consistent propagation trait only for type unions to fall short of sets in a rigorous mathematical sense anyway. For example, >= and =< of floating points always give Bool, but throw MethodErrors for Union{Float64, Missing, Expr} because of the Expr alone, so that type union cannot behave like a partially ordered set even if missing never propagated. IEEE 754-standard partial order can be another reason that NaN is still useful in a world with missing, on top of performance and more robust type stability; missing was never meant to replace NaN.

2 Likes

sure it can exist in a perfectly rigorous way

a partial order on a set X is just a list of pairs (a,b) ∈ S ⊂ X. where one might use as an indicator function with ≤(a, b) --> ((a,b) ∈ S)

the additional restriction that a total order gives you is that for any a,b ∈ X, then either (a,b) ∈ S or (b, a) ∈ S.

1 Like

Correction: (a,b) ∈ S ⊆ X x X (x meaning Cartesian product), and total order means S = X x X. Partial orders include total orders.

No, there are sets with no partial orders. This is somewhat the default for >=, =<, >, < in Julia because custom types don’t have a fallback isless implementation. Despite those throwing MethodErrors within such types, those still propagate missing. Julia isn’t rigorous set theory, and there’s no benefit to force it to be.

2 Likes

correction accepted :slight_smile:

there’s no benefit to force it to be.

right, I understand. all am I saying is I would really like us to agree that x < missing = false is not “wrong” but simply a reasonable choice that could have been made (but was not, due to other considerations).

I think I have said this quite a few times and been told (not by you) somewhat emphatically that x < missing = missing is simply the only correct option, and anyone who desires otherwise is being sloppy.

I personally must disagree due to missing being intended to work with any other type, including those lacking any implementation of partial order. After all, if typeof(x) does not implement comparisons, it is unreasonable for a comparison to fall back to false as if unordered relations exist. missing would also no longer have a consistent behavior, which other dissenters expressed appreciation for.

NaN on the other hand works with a floating point type, and it’s an instance of that type, not some outsider type like Missing. There it makes sense to implement partial order for each concrete subtype of AbstractFloat. This seems like the reasonable choice since the implication so far was comparison with real numbers. Maybe NaN usage can be more streamlined there.

No definition is wrong :slight_smile: But I would say that defining x ≤ missing = false is not useful.

Let’s say that x ≤ missing = false.

  1. Then a ≤ a does not hold for a=missing so we lose reflexivity.

  2. Even in partially-ordered sets, either a ≤ b or b ≤ a (or both) must hold, assuming a and b are comparable. But when a=missing, a ≤ b is false and b ≤ a is false.

Relation would not be a partial order anymore, would it? Could you comment on that?

no

assuming a and b are comparable

this “assumption” makes it now a total order, not a partial order.

Could you comment on that?

yes, and I have already multiple times. please see condition 4. in wiki for total order, the condition which is absent in the definition for a partial order. a partial order is not defined as a boolean function, it is defined as a collection of pairs (and in code it is convenient to represent this collection of pairs using an indicator aka boolean function). but for incomparable elements this indicator will be False

also note that if missing \leq missing is false then would also need to call it a non-strict partial order. if you want strictness then missing \leq missing would have to return true

Nothing personal, but I think my continued participation in this thread will not be productive :slight_smile:

1 Like

Good catch. I suppose floating points really are not partially ordered because of NaN <= NaN.

That’s what incomparable pairs are. A set with a partial order doesn’t need every pair to be comparable, that is the partial order can be a strict subset of the Cartesian product of the set with itself.

This is strong connectedness, the additional condition for a partial order to also be a total order.

2 Likes

I see it now! I mixed the programming and mathematical point of views. Thank you and @Benny for pointing that out.

1 Like