Get Correlation matrix from dataframe with row and column titles

I have a dataframe and I want to get the correlation matrix with row/column titles so I know what Iā€™m looking at. If I do this

cormat = cor(Matrix(df))

There are no row/column titles. How do I get a correlation matrix / dataframe with row and column titles?

1 Like

You can use NamedArrays.jl for labeled axes.

julia> na = NamedArray(Matrix(df))
5Ɨ3 Named Matrix{Float64}
A ā•² B ā”‚         1          2          3
ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
1     ā”‚  0.887614   0.695876  0.0307668
2     ā”‚  0.133595   0.428006   0.400858
3     ā”‚  0.149537   0.868851   0.612551
4     ā”‚  0.179476   0.371094   0.976713
5     ā”‚  0.565508   0.990754  0.0458062

julia> setnames!(na, names(df), 2)
(OrderedCollections.OrderedDict{Any, Int64}("1" => 1, "2" => 2, "3" => 3, "4" => 4, "5" => 5), OrderedCollections.OrderedDict{Any, Int64}("x" => 1, "y" => 2, "z" => 3))

julia> na
5Ɨ3 Named Matrix{Float64}
A ā•² B ā”‚         x          y          z
ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
1     ā”‚  0.887614   0.695876  0.0307668
2     ā”‚  0.133595   0.428006   0.400858
3     ā”‚  0.149537   0.868851   0.612551
4     ā”‚  0.179476   0.371094   0.976713
5     ā”‚  0.565508   0.990754  0.0458062

julia> cor(na)
3Ɨ3 Named Matrix{Float64}
B ā•² B ā”‚         x          y          z
ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
x     ā”‚       1.0   0.406237  -0.783476
y     ā”‚  0.406237        1.0  -0.588138
z     ā”‚ -0.783476  -0.588138        1.0
1 Like

Would something like this help?

using DataFrames, Statistics
cols = [:x, :y, :z]
df = DataFrame(rand(5,3), cols)

dfcor = [cols DataFrame(cor(Matrix(df)), cols)]


 Row ā”‚ x1      x           y           z        
     ā”‚ Symbol  Float64     Float64     Float64
ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
   1 ā”‚ x        1.0        -0.0635502  0.252657
   2 ā”‚ y       -0.0635502   1.0        0.370023
   3 ā”‚ z        0.252657    0.370023   1.0

The code below is not super efficient, but I often use it:

NamedArray(cor(Matrix(df)), (names(df), names(df)))
3 Likes

I personally prefer AxisArrays.jl:

julia> C = AxisArray(rand(3,3), row=[:a,:b,:c], col=[:a,:b,:c])
2-dimensional AxisArray{Float64,2,...} with axes:
    :row, [:a, :b, :c]
    :col, [:a, :b, :c]
And data, a 3Ɨ3 Matrix{Float64}:
 0.0344525  0.210511  0.365664
 0.270243   0.319163  0.113199
 0.960284   0.832544  0.105832

julia> C[:a,:b]
0.21051123926125237

julia> C[1,2]
0.21051123926125237

This is something that should honestly be made easier. Perhaps a DataFrameStats.jl convenience package where operations like this are already implemented behind the scenes. I always run into this pain.

2 Likes

The printing could definitely be improved here though. The NamedArrays.jl version is easier to read because it lists the column and row names directly.

1 Like

I would make it TableStats.jl and would try to support any Tables.jl though. A set of basic convenience functions that generalize cov, cor, mean, kurtosis, etc from Statistics and StatsBase.jl to work with table types.

1 Like

Taking some arguments from a discussion on correlated :wink: issues

julia> using DataFrames, Statistics

julia> cols = [:x, :y, :z]
3-element Vector{Symbol}:
 :x
 :y
 :z

julia> df = DataFrame(rand(5,3), cols)
5Ɨ3 DataFrame
 Row ā”‚ x         y         z        
     ā”‚ Float64   Float64   Float64
ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
   1 ā”‚ 0.234054  0.613994  0.557298
   2 ā”‚ 0.318559  0.748663  0.322644
   3 ā”‚ 0.999855  0.167148  0.801001
   4 ā”‚ 0.998599  0.549592  0.291748
   5 ā”‚ 0.932561  0.72125   0.688496

julia> corr=(;zip(cols,Tables.rowtable(DataFrame(cor(Matrix(df)),cols)))...)
(x = (x = 1.0, y = -0.4998593209925881, z = 0.32929593442917343), y = (x = -0.4998593209925881, y = 1.0, z = -0.5560163100630737), z = (x = 0.32929593442917343, y = -0.5560163100630737, z = 1.0))

julia> corr[:x][:x]
1.0

julia> corr[:z][:x]==corr[:x][:z]
true

That works nicely. Thanks.

Hello, do you know if there have been any developments on these aspects since the original post? Any new packages to do that smoothly with Dataframes ?
Thank you!