First impression of DataFrames.jl

Hi,

Trying to step into Julia these days, so far so good. Really excited about the multiple dispatching and type hint. Got a few questions regarding DataFrames.

Currently I am using the prostate cancer dataset from here. It is a 97x11 dataset. In pandas, you can call the corr method on the dataframe itself to generate a correlation matrix of all columns. Is there a similar method that I can call to get that on the Julia side?

Thank you!

Hi! Thank you for using DataFrames.jl. What type and shape of output you would expect? If you want a Matrix then just do:

using Statistics
cor(Matrix(your_data_frame))

but things with calculating correlations are quite involved so depending on the details you want a specific answer might be different (in particular - do you have missing values and how do you want to handle them?).

EDIT: what type of correlation do you want to calculate (I assumed Pearson correlation coefficient)?

2 Likes

In DataConvenience.jl there is dfcor

dfcor(df) should work.

1 Like

Hi! Thank you for the reply. I am looking for something like this:

And you are right, for now I am thinking of pearson correlation.

You can get it like this:

julia> using DataFrames, NamedArrays, Statistics

julia> df = DataFrame(x1=rand(10), x2=rand(10), x3=rand(10), x4=rand(10))
10Ɨ4 DataFrame
ā”‚ Row ā”‚ x1       ā”‚ x2       ā”‚ x3        ā”‚ x4         ā”‚
ā”‚     ā”‚ Float64  ā”‚ Float64  ā”‚ Float64   ā”‚ Float64    ā”‚
ā”œā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
ā”‚ 1   ā”‚ 0.158358 ā”‚ 0.942125 ā”‚ 0.713538  ā”‚ 0.630956   ā”‚
ā”‚ 2   ā”‚ 0.374393 ā”‚ 0.813797 ā”‚ 0.504182  ā”‚ 0.947029   ā”‚
ā”‚ 3   ā”‚ 0.520227 ā”‚ 0.542249 ā”‚ 0.833646  ā”‚ 0.609631   ā”‚
ā”‚ 4   ā”‚ 0.928563 ā”‚ 0.402397 ā”‚ 0.444998  ā”‚ 0.232351   ā”‚
ā”‚ 5   ā”‚ 0.230898 ā”‚ 0.824582 ā”‚ 0.199968  ā”‚ 0.00203982 ā”‚
ā”‚ 6   ā”‚ 0.197203 ā”‚ 0.84624  ā”‚ 0.408122  ā”‚ 0.636816   ā”‚
ā”‚ 7   ā”‚ 0.168241 ā”‚ 0.281407 ā”‚ 0.665497  ā”‚ 0.949534   ā”‚
ā”‚ 8   ā”‚ 0.494666 ā”‚ 0.39342  ā”‚ 0.236596  ā”‚ 0.522137   ā”‚
ā”‚ 9   ā”‚ 0.431282 ā”‚ 0.425107 ā”‚ 0.0946223 ā”‚ 0.584611   ā”‚
ā”‚ 10  ā”‚ 0.639034 ā”‚ 0.256714 ā”‚ 0.4461    ā”‚ 0.298986   ā”‚

julia> struct NoPrint end; Base.show(::IO, ::NoPrint) = nothing

julia> NamedArray([i > j ? cor(df[!, i], df[!, j]) : NoPrint() for i in 2:ncol(df), j in 1:ncol(df)-1],
                  (names(df)[2:end], names(df)[1:end-1]))
3Ɨ3 Named Array{Any,2}
A ā•² B ā”‚        x1         x2         x3
ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
x2    ā”‚ -0.555227                      
x3    ā”‚ -0.102989  0.0988691           
x4    ā”‚ -0.407655  0.0381506   0.444643

note again - that this assumes you do not need to do handling of missing values (as there are several strategies that could be used here).

7 Likes