Skipmissing no working in cor function

Hi everyone,

I am trying to make a correlation in Julia but one line of missing values. (My code is below).

I use the skipmissing and the result is skipmissing(missing), if I don’t put that it says missing.

I tried the correlation function with a complete DataFrame and everything works properly.

Any help would be appreciated.

using DataFrames, Pkg, CSV, Gadfly, HypothesisTests, Statistics, Missings

df = CSV.read("/Users/home/Documents/MP blog 2021/Data/Eliminatorias/CONMEBOL/Conmebol_partidos_2022.csv", DataFrame, normalizenames=true)


skipmissing(cor(df.Remates_arco, df.Pases))

``

You are applying the skipmissing too late:

You are first applying the cor, which returns missing because your dataframes have missing values. Then you are applying the skipmissing, but the only thing you have at that moment is a missing value. Hence what you are doing is skipmissing(missing)

If both columns have the same missing values, I suppose you could do:

cor( skipmissing(df.Remates_arco), skipmissing(df.Pases) )

I am not sure if that works because I don’t use Dataframes a lot, but looking at their documentation, you could also do:

newdf = dropmissing(df)
cor(newdf.Remates_arco, newdf.Pases))

First, you are applying skipmissing too late.

But even if you were to correct it, cor wouldn’t work. This is a long-standing annoyance.

This is not a good idea, since the observations are not guaranteed to be matched.

We don’t have a good solution for this at the moment. Missings.jl (which is re-exported by DataFrames) provides skipmissings.

julia> using Missings, Statistics

julia> x = [rand() < .2 ? missing : rand() for i in 1:10];

julia> y = [rand() < .2 ? missing : rand() for i in 1:10];

julia> sx, sy = collect.(skipmissings(x, y));

julia> cor(sx, sy)
-0.32257867573052007

But skipmissings is not guaranteed to exist in the future. It’s deliberately documented as such even though Missings.jl is past 1.0.

4 Likes

You can do it without the Missings package:

pos = (.!ismissing.(x)) .& (.!ismissing.(y))
cor(x[pos], y[pos])

It would be more difficult to create a replacement for skipmissings if we had multiple variables instead of just two.

5 Likes

This one suffers from the same problem mentioned by @pdeffebach, (it doesn’t guarantee that the observations will match).

My apologies. As far as I can see your method would work. I misread it.