First impression of DataFrames.jl

digitalpig · November 7, 2020, 8:45pm

Hi,

Trying to step into Julia these days, so far so good. Really excited about the multiple dispatching and type hint. Got a few questions regarding DataFrames.

Currently I am using the prostate cancer dataset from here. It is a 97x11 dataset. In pandas, you can call the corr method on the dataframe itself to generate a correlation matrix of all columns. Is there a similar method that I can call to get that on the Julia side?

Thank you!

bkamins · November 7, 2020, 9:15pm

Hi! Thank you for using DataFrames.jl. What type and shape of output you would expect? If you want a Matrix then just do:

using Statistics
cor(Matrix(your_data_frame))

but things with calculating correlations are quite involved so depending on the details you want a specific answer might be different (in particular - do you have missing values and how do you want to handle them?).

EDIT: what type of correlation do you want to calculate (I assumed Pearson correlation coefficient)?

xiaodai · November 8, 2020, 12:41am

In DataConvenience.jl there is dfcor

dfcor(df) should work.

digitalpig · November 8, 2020, 12:51am

Hi! Thank you for the reply. I am looking for something like this:

And you are right, for now I am thinking of pearson correlation.

bkamins · November 8, 2020, 7:22am

You can get it like this:

julia> using DataFrames, NamedArrays, Statistics

julia> df = DataFrame(x1=rand(10), x2=rand(10), x3=rand(10), x4=rand(10))
10×4 DataFrame
│ Row │ x1       │ x2       │ x3        │ x4         │
│     │ Float64  │ Float64  │ Float64   │ Float64    │
├─────┼──────────┼──────────┼───────────┼────────────┤
│ 1   │ 0.158358 │ 0.942125 │ 0.713538  │ 0.630956   │
│ 2   │ 0.374393 │ 0.813797 │ 0.504182  │ 0.947029   │
│ 3   │ 0.520227 │ 0.542249 │ 0.833646  │ 0.609631   │
│ 4   │ 0.928563 │ 0.402397 │ 0.444998  │ 0.232351   │
│ 5   │ 0.230898 │ 0.824582 │ 0.199968  │ 0.00203982 │
│ 6   │ 0.197203 │ 0.84624  │ 0.408122  │ 0.636816   │
│ 7   │ 0.168241 │ 0.281407 │ 0.665497  │ 0.949534   │
│ 8   │ 0.494666 │ 0.39342  │ 0.236596  │ 0.522137   │
│ 9   │ 0.431282 │ 0.425107 │ 0.0946223 │ 0.584611   │
│ 10  │ 0.639034 │ 0.256714 │ 0.4461    │ 0.298986   │

julia> struct NoPrint end; Base.show(::IO, ::NoPrint) = nothing

julia> NamedArray([i > j ? cor(df[!, i], df[!, j]) : NoPrint() for i in 2:ncol(df), j in 1:ncol(df)-1],
                  (names(df)[2:end], names(df)[1:end-1]))
3×3 Named Array{Any,2}
A ╲ B │        x1         x2         x3
──────┼────────────────────────────────
x2    │ -0.555227                      
x3    │ -0.102989  0.0988691           
x4    │ -0.407655  0.0381506   0.444643

note again - that this assumes you do not need to do handling of missing values (as there are several strategies that could be used here).

Topic		Replies	Views
Whats the easiest way to create correlation matrices in Julia? New to Julia question , statistics	5	4288	November 5, 2021
Get Correlation matrix from dataframe with row and column titles General Usage dataframes	10	877	February 29, 2024
How to create a DataFrame from 1×24×24 Array{Float64, 3} General Usage dataframes	4	464	May 29, 2021
Correlation between categorial variables New to Julia statistics	3	1063	January 14, 2022
Rewriting dplyr code which uses a function of columns in Julia -style using DataFrames.jl General Usage dataframes	5	601	March 25, 2021

First impression of DataFrames.jl

Related topics