How does StatsBase.skewness work?

I did not say “people won’t adopt Julia unless X”. I was just sharing my opinion on the subject. Thanks.

1 Like

Hi Tamas,
I would like to come back to your argument (the missing value theme for me is in principle through)…

… you only convince the development community of Julia. You can only convince users who want to model mathematical problems or - like me - perform data analysis if Julia is easy to use, fast and stable. But if you say that you want to deal with missing values, you have to call this and that function in addition, the motivation is certainly not that high. Let me give you an example from another area (sorry for that): The car industry no longer advertises torque or overhead camshafts, but driving comfort and safety.

But I don’t want to stress it, I just want to give suggestions.
Cheers,
Guenter

I am not sure how you came to this conclusion — unfortunately, I don’t have reliable data on this, and extrapolating from my own preferences is not a substitute.

That said, even if there are users who would prefer a DWIM-style interface that makes some choices for you, I think it is totally fine for Julia to appeal to the people who prefer to make these choices explicit. You can’t please everyone.

Regarding missing values, automatic handling of them can be particularly insidious apart from simple cases. Consider the covariate matrix in a regression (without intercepts)

X = [1 2;
     3 4;
     5 missing;
     missing 10]

“Dropping missing automatically” here in OLS would presumably drop rows 3 and 4. What should

mapslices(mean, X; dims = 1)
mapslices(skewness, X; dims = 1)

do, drop one element from each column automatically? That would make it inconsistent with other calculations. The small price you pay for being explicit about how you handle missing values saves you from a lot of potential bugs.

2 Likes

Alternatively, someone is free to write a package with more DWIM and users that like Julia but prefer that Melanie are welcome to use it. I can certainly see the value, and might even want that sort of behavior in some circumstances. But I definitely think the default behavior should require explicitness.

1 Like

I guess one could create a wrapper type, SkipMissingArray <: AbstractArray. While skipmissing returns an iterator, skipmissingarray returns a SkipMissingArray.

It would solve the weighted average issue, see https://discourse.julialang.org/t/how-to-calculate-a-weighted-mean-with-missing-observations/1928, because one could define a particular method mean(x::SkipMissingArray, w::Weights(<:SkipMissingArray)

One issue is that functions tend to output Vector{Union{Missing, Array}} so users would always need to convert every created vector into a SkipMissingArray. It would be very cumbersome. I’m not sure one can write a package that acts as if the default behavior of Julia was to automatically skip missings.

One problem with that approach is that skipmissing is very memory efficient because you can’t index a SkipMissing object. (You can’t do skipmissing(x)[4] for example). This is a trade-off that makes sense because as stated above, the vast majority of functions don’t really need an array input. Actually creating an object that has all the behaviors required for an Array likely requires looking at the original object in a less efficient way.

The issue you linked to isn’t quite the same. I am more there about making sure the iterators are synced, without necessarily allowing indexing of each one.

1 Like

In case where the iterator skipmissing is enough, then it would just use it, i.e.:

mean(x::SkipMissingArray) = mean(skipmissing(x))

The key point is that the result of skipmissingarray is an AbstractArray, so that, from the user point of view, it still behaves as a normal array, e.g. one can have a DataFrames of SkipMissingArrays, etc.

That would only be possible if all columns had the same number of missing values, or they wouldn’t have the same length.

Anyway skipmissing already returns a SkipMissing object for which special methods can be defined (which is already used a lot for reductions in Base). As I noted in the thread about the weighted mean, this could perfectly be used there.

1 Like

The length of the SkipMissingArrays would be exactly the same as the underlying array. Think of it as a wrapper that basically applies the option skipmissing = true by default to mean, skewness etc.
I’m not sure it is the answer to all the solutions, but that may be one way to implement automatic skipmissing though external packages.

I had thought about that before, and I don’t think that can work. The AbstractArray interface assumes that the length and the number of indices are consistent with each other. I guess you could return indices including missing values, and only skip missing values during iteration, but many functions do for i in eachindex(X); X[i]... rather than for x in X; x..., in which case you will have to either return missing or throw an error for missing values.

1 Like