First impression of DataFrames.jl

Hi,

Trying to step into Julia these days, so far so good. Really excited about the multiple dispatching and type hint. Got a few questions regarding DataFrames.

Currently I am using the prostate cancer dataset from here. It is a 97x11 dataset. In pandas, you can call the corr method on the dataframe itself to generate a correlation matrix of all columns. Is there a similar method that I can call to get that on the Julia side?

Thank you!

Hi! Thank you for using DataFrames.jl. What type and shape of output you would expect? If you want a Matrix then just do:

using Statistics
cor(Matrix(your_data_frame))

but things with calculating correlations are quite involved so depending on the details you want a specific answer might be different (in particular - do you have missing values and how do you want to handle them?).

EDIT: what type of correlation do you want to calculate (I assumed Pearson correlation coefficient)?

2 Likes

In DataConvenience.jl there is dfcor

dfcor(df) should work.

1 Like

Hi! Thank you for the reply. I am looking for something like this:

And you are right, for now I am thinking of pearson correlation.

You can get it like this:

julia> using DataFrames, NamedArrays, Statistics

julia> df = DataFrame(x1=rand(10), x2=rand(10), x3=rand(10), x4=rand(10))
10Ɨ4 DataFrame
│ Row │ x1       │ x2       │ x3        │ x4         │
│     │ Float64  │ Float64  │ Float64   │ Float64    │
ā”œā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│ 1   │ 0.158358 │ 0.942125 │ 0.713538  │ 0.630956   │
│ 2   │ 0.374393 │ 0.813797 │ 0.504182  │ 0.947029   │
│ 3   │ 0.520227 │ 0.542249 │ 0.833646  │ 0.609631   │
│ 4   │ 0.928563 │ 0.402397 │ 0.444998  │ 0.232351   │
│ 5   │ 0.230898 │ 0.824582 │ 0.199968  │ 0.00203982 │
│ 6   │ 0.197203 │ 0.84624  │ 0.408122  │ 0.636816   │
│ 7   │ 0.168241 │ 0.281407 │ 0.665497  │ 0.949534   │
│ 8   │ 0.494666 │ 0.39342  │ 0.236596  │ 0.522137   │
│ 9   │ 0.431282 │ 0.425107 │ 0.0946223 │ 0.584611   │
│ 10  │ 0.639034 │ 0.256714 │ 0.4461    │ 0.298986   │

julia> struct NoPrint end; Base.show(::IO, ::NoPrint) = nothing

julia> NamedArray([i > j ? cor(df[!, i], df[!, j]) : NoPrint() for i in 2:ncol(df), j in 1:ncol(df)-1],
                  (names(df)[2:end], names(df)[1:end-1]))
3Ɨ3 Named Array{Any,2}
A ╲ B │        x1         x2         x3
──────┼────────────────────────────────
x2    │ -0.555227                      
x3    │ -0.102989  0.0988691           
x4    │ -0.407655  0.0381506   0.444643

note again - that this assumes you do not need to do handling of missing values (as there are several strategies that could be used here).

7 Likes