How does StatsBase.skewness work?


#1

Hello,

It is about the application of the skewness function. But I admit that I am not (yet) a Julia expert. But to my question:

I have a DataFrame Rotwein and when I make the following function call …

julia> StatsBase.skewness(Rotwein[:residual_sugar])

… I get the following error message:

julia> StatsBase.skewness(Rotwein[:residual_sugar])
ERROR: MethodError: no method matching skewness(::Array{Union{Missing, Float64},1})
Closest candidates are:
  skewness(::Distributions.DiscreteUniform) at C:\Users\guent\.juliapro\packages\Distributions\WHjOk\src\univariate\discrete\discreteuniform.jl:58
  skewness(::Distributions.Hypergeometric) at C:\Users\guent\.juliapro\packages\Distributions\WHjOk\src\univariate\discrete\hypergeometric.jl:61
  skewness(::Distributions.EmpiricalUnivariateDistribution) at C:\Users\guent\.juliapro\packages\Distributions\WHjOk\src\empirical.jl:46
  ...

Okay, doesn’t seem to work with the DataFrame type. Now I convert this vector of the DataFrame into the vector M1:

julia> M1
1599-element Array{Union{Missing, Float64},1}:
  7.4
  7.8
  7.8
 11.2
  7.4
  7.4
  7.9
  7.3
  7.8
  7.5
  ⋮
  6.3
  5.4
  6.3
  6.8
  6.2
  5.9
  6.3
  5.9
  6.0

and then call the function again:

julia> StatsBase.skewness(M1)
ERROR: MethodError: no method matching skewness(::Array{Union{Missing, Float64},1})
Closest candidates are:
  skewness(::Distributions.DiscreteUniform) at C:\Users\guent\.juliapro\packages\Distributions\WHjOk\src\univariate\discrete\discreteuniform.jl:58
  skewness(::Distributions.Hypergeometric) at C:\Users\guent\.juliapro\packages\Distributions\WHjOk\src\univariate\discrete\hypergeometric.jl:61
  skewness(::Distributions.EmpiricalUnivariateDistribution) at C:\Users\guent\.juliapro\packages\Distributions\WHjOk\src\empirical.jl:46

And get the same error message. I don’t understand, what’s wrong? Isn’t it enough if I simply pass a vector like in R?

> library(e1071)
> skewness(Rotwein$residual.sugar)
[1] 4.53214

Does anyone have a clue for me?
Thank you,
Guenter


#2

The skewness function is defined essentially on Vector{Real}, but in your case, you have a Vector{Union{Missing, Float64}}, meaning your vector might have missing (null) values. You can ignore missing values by doing skewness(skipmissing(M1)), or you could treat your missing values in a particular way, like replacing them with 0 like skewness(map(x->coalesce(x, 0.0), M1)). In general, Julia has taken the approach of opting to require users to explicitly state how missing values should be handled rather than assume the user wants them handled in one way or another. That’s of course always up for debate; for example, perhaps skewness should just call skipmissing itself by default, but that would be something to open an issue in StatsBase.jl about.


#3

See

I think skewness(skipmissing(M1)) won’t work, you need skewness(collect(skipmissing(M1))).


#4

Thanks for pointing out that “Missing values” was displayed. The whole record “red wine” contains no “missing values” :


The hint from nalimilan was helpful (see there), only that doesn’t make sense either, because no values are missing…!?
I’m still confused about the function behavior, but with this hint it works. :wink:

Thank you,
Guenter


#5

Thanks, that’s how it worked, even with DataFrames! Only I don’t understand it because there are no missing values marked in the vector…! But here is the working example:

julia> M1 = collect(skipmissing(Rotwein[:residual_sugar]))
1599-element Array{Float64,1}:
 1.9
 2.6
 2.3
 1.9
 1.9
 1.8
 1.6
 1.2
 2.0

As expected, there are still 1599 observations. But now the function call works:

julia> StatsBase.skewness(M1)
4.536394788805635

Also available as DataFrame:

julia> StatsBase.skewness(collect(skipmissing(Rotwein[:residual_sugar])))
4.536394788805635

R does not detect missing values where there are none. As I said, I am a little surprised.

Thank you,
Guenter


#6

I don’t think this would be a good idea. The next request would be that it also calls skipnothing, then filter(!isnan), etc. I think it is much better to expect valid input from the user, but at the same time make it easy to satisfy this requirement. skewness(skipmissing(...)) is much more informative to read because I know precisely what is going on.


#7

The function doesn’t know there are no missing values, it’s going based off the type of the vector, which you can see in the error message. Not sure where that type signature came from, but if it really has no missing, you can replace the column with a pure Float vector


#8

Perhaps we should think about the idea of ignoring missing values in the argument? What contribution do missing values have to a distribution? It is more important to have a sufficient number of observations to estimate the skewness. Then a hint that too few observations are available would be more helpful. Or not?


#9

This question can have very complex answers, depending on how they ended up missing, this is probably beyond the scope of StatsBase. I was simply making the point that automatically applying skipmissing would

  1. is tantamount to MCAR, implicitly making a choice that should be made by the user,
  2. also masks bugs in the caller’s code (if missing values were unexpected),
  3. opens the door for requests for other kinds of input sanitization.

#10

Maybe, but that’s a decision one should make explicitly. If one has missing values, the distribution might be quite different than what you would think based on the observed values.


#11

I would love for skipmissing = true to be the default. The way I see it, if a user is concerned with missing values, she can use containers that restrict the presence of missing values. I disagree that one default is more explicit than the other.


#12

This is a good idea to provide such functions with an attribute like “missing = true/false”. Then the user also has control over the function. I can also understand the hint from Tamas, but in the end the functions should be user-friendly and not too nested.

That may be, but it is more a problem of the number of observations. If I extend a vector (a DataFrame I leave out) that contains enough observations by missing values, I certainly don’t change the distribution. If I then exclude the missing values by collect(skipmissing(…)), why not use it as a function attribute? But thanks again for your answers, I just wanted to give a suggestion.


#13

You mean a keyword argument? Lazy maps like skipmissing are much more idiomatic in Julia, allowing composition and transformations.


#14

For the data analyst, dealing with missing values is a daily business. Anything that makes this easier is welcome.


#15

One thing that might be useful in your case:

dropmissing(Rotwein, disallowmissing = true)

which will drop all rows that contain missing - I understand in your case this doesn’t affect any columns, so the effect will essentially be to coerce all Union{T, Missing} columns to just being of type T.

You can also deal with missing upfront when reading in your data, e.g.

CSV.read("Rotwein.csv", allowmissing = :none)

Which will return only “pure” columns rather than unions. Note that this only works if there actually aren’t any missing values in your file, if you have some columns that contain missing you might want to go for allowmissing = :auto.

Coming from Python, R, Stata I agree that Julia’s approach to missing seemed rather pedantic and cumbersome at first, but I think I’ve now gotten around to this way of thinking more explicitly about the problem (and trying to purge all Unions where I don’t need them!)


#16

allowmissing = :none or allowmissing = :auto is certainly a good idea for clarity, convenience and performance if you know there are no missing values. We use allowmissing = :always by default because parsing CSV files currently fails if missing values are not encountered during the type detection phase. This might change once CSV.jl is able to handle this situation better.

For cases where there are missing values and you want to skip them, Julia just behaves like R, except that skipmissing is a function while in R na.rm (or useNA, or use="na.or.complete"…) is an argument. That can be inconvenient when you deal with missing values all the time (I do), but so far we haven’t found any solution which would both be safe by default, but would allow opting-in for skipping missing values by default. See also this blog post for some context and rationale.


#17

Thanks for the clues, nilshg & nalimilan! I will play around with the hints a little to get to know the possibilities. Of course I don’t want to lose control over the missing values. Because it is always important to know if and how many missing values are available.
I also believe - in my opinion today - that the missing values should also be read in via CSV.il, because a postprocessing in Julia leads to more knowledge about the data set.

Thanks also to all who responded, it was/is an interesting and substantial discussion!


#18

In term of adoption of Julia in the data analysis community. I can imagine a ton of people being put off by all the current problems with missing values (this issue with skewness, related issue with weighted functions).
In contrast, I can’t imagine anyone not picking up Julia because it automatically skips missing values.


#19

I think the community has become pretty immune to the “people won’t adopt Julia unless you do what I want” arguments; especially as quite a few parts of various APIs were hammered out patiently through multiple (breaking) iterations instead of just picking what Matlab/R/Python does, and Julia’s adoption has kept steadily increasing.

If you want to convince people about a particular API choice, I would recommend you base your arguments on technical merits.


#20

BTW, this is an area where improvements can be made without too much work and even without a deep experience with Julia development. One basically needs to go over all problematic functions and ensure they accept any iterator instead of just arrays. That’s a great task for a first pull request.