Skipmissing no working in cor function

jsaraviadrago · November 10, 2021, 11:01pm

Hi everyone,

I am trying to make a correlation in Julia but one line of missing values. (My code is below).

I use the skipmissing and the result is skipmissing(missing), if I don’t put that it says missing.

I tried the correlation function with a complete DataFrame and everything works properly.

Any help would be appreciated.

using DataFrames, Pkg, CSV, Gadfly, HypothesisTests, Statistics, Missings

df = CSV.read("/Users/home/Documents/MP blog 2021/Data/Eliminatorias/CONMEBOL/Conmebol_partidos_2022.csv", DataFrame, normalizenames=true)


skipmissing(cor(df.Remates_arco, df.Pases))

``

aramirezreyes · November 10, 2021, 11:13pm

You are applying the skipmissing too late:

You are first applying the cor, which returns missing because your dataframes have missing values. Then you are applying the skipmissing, but the only thing you have at that moment is a missing value. Hence what you are doing is skipmissing(missing)

If both columns have the same missing values, I suppose you could do:

cor( skipmissing(df.Remates_arco), skipmissing(df.Pases) )

I am not sure if that works because I don’t use Dataframes a lot, but looking at their documentation, you could also do:

newdf = dropmissing(df)
cor(newdf.Remates_arco, newdf.Pases))

pdeffebach · November 11, 2021, 12:24am

First, you are applying skipmissing too late.

But even if you were to correct it, cor wouldn’t work. This is a long-standing annoyance.

This is not a good idea, since the observations are not guaranteed to be matched.

We don’t have a good solution for this at the moment. Missings.jl (which is re-exported by DataFrames) provides skipmissings.

julia> using Missings, Statistics

julia> x = [rand() < .2 ? missing : rand() for i in 1:10];

julia> y = [rand() < .2 ? missing : rand() for i in 1:10];

julia> sx, sy = collect.(skipmissings(x, y));

julia> cor(sx, sy)
-0.32257867573052007

But skipmissings is not guaranteed to exist in the future. It’s deliberately documented as such even though Missings.jl is past 1.0.

Juan · November 11, 2021, 1:54am

You can do it without the Missings package:

pos = (.!ismissing.(x)) .& (.!ismissing.(y))
cor(x[pos], y[pos])

It would be more difficult to create a replacement for skipmissings if we had multiple variables instead of just two.

aramirezreyes · November 11, 2021, 2:00am

This one suffers from the same problem mentioned by @pdeffebach, (it doesn’t guarantee that the observations will match).

aramirezreyes · November 11, 2021, 4:57pm

My apologies. As far as I can see your method would work. I misread it.

Topic		Replies	Views
Covariance from DataFrame or TimeArray New to Julia statistics , dataframes , finance	17	2030	October 24, 2021
Building a Rolling Correlation Function New to Julia	9	1640	September 12, 2018
How can I skip missing values of a DF without deleating them? New to Julia dataframes	4	693	November 11, 2021
Rationale for dropmissing vs skipmissing General Usage question , dataframes	2	141	August 24, 2024
Problems about dealing with missing values, maybe connected to DataFrames.jl Data question	4	759	December 4, 2018

Skipmissing no working in cor function

Related topics