Get Correlation matrix from dataframe with row and column titles

I have a dataframe and I want to get the correlation matrix with row/column titles so I know what Iā€™m looking at. If I do this

cormat = cor(Matrix(df))

There are no row/column titles. How do I get a correlation matrix / dataframe with row and column titles?

You can use NamedArrays.jl for labeled axes.

julia> na = NamedArray(Matrix(df))
5Ɨ3 Named Matrix{Float64}
A ā•² B ā”‚         1          2          3
ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
1     ā”‚  0.887614   0.695876  0.0307668
2     ā”‚  0.133595   0.428006   0.400858
3     ā”‚  0.149537   0.868851   0.612551
4     ā”‚  0.179476   0.371094   0.976713
5     ā”‚  0.565508   0.990754  0.0458062

julia> setnames!(na, names(df), 2)
(OrderedCollections.OrderedDict{Any, Int64}("1" => 1, "2" => 2, "3" => 3, "4" => 4, "5" => 5), OrderedCollections.OrderedDict{Any, Int64}("x" => 1, "y" => 2, "z" => 3))

julia> na
5Ɨ3 Named Matrix{Float64}
A ā•² B ā”‚         x          y          z
ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
1     ā”‚  0.887614   0.695876  0.0307668
2     ā”‚  0.133595   0.428006   0.400858
3     ā”‚  0.149537   0.868851   0.612551
4     ā”‚  0.179476   0.371094   0.976713
5     ā”‚  0.565508   0.990754  0.0458062

julia> cor(na)
3Ɨ3 Named Matrix{Float64}
B ā•² B ā”‚         x          y          z
ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
x     ā”‚       1.0   0.406237  -0.783476
y     ā”‚  0.406237        1.0  -0.588138
z     ā”‚ -0.783476  -0.588138        1.0
1 Like

Would something like this help?

using DataFrames, Statistics
cols = [:x, :y, :z]
df = DataFrame(rand(5,3), cols)

dfcor = [cols DataFrame(cor(Matrix(df)), cols)]


 Row ā”‚ x1      x           y           z        
     ā”‚ Symbol  Float64     Float64     Float64
ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
   1 ā”‚ x        1.0        -0.0635502  0.252657
   2 ā”‚ y       -0.0635502   1.0        0.370023
   3 ā”‚ z        0.252657    0.370023   1.0

The code below is not super efficient, but I often use it:

NamedArray(cor(Matrix(df)), (names(df), names(df)))
3 Likes

I personally prefer AxisArrays.jl:

julia> C = AxisArray(rand(3,3), row=[:a,:b,:c], col=[:a,:b,:c])
2-dimensional AxisArray{Float64,2,...} with axes:
    :row, [:a, :b, :c]
    :col, [:a, :b, :c]
And data, a 3Ɨ3 Matrix{Float64}:
 0.0344525  0.210511  0.365664
 0.270243   0.319163  0.113199
 0.960284   0.832544  0.105832

julia> C[:a,:b]
0.21051123926125237

julia> C[1,2]
0.21051123926125237

This is something that should honestly be made easier. Perhaps a DataFrameStats.jl convenience package where operations like this are already implemented behind the scenes. I always run into this pain.

2 Likes

The printing could definitely be improved here though. The NamedArrays.jl version is easier to read because it lists the column and row names directly.

1 Like

I would make it TableStats.jl and would try to support any Tables.jl though. A set of basic convenience functions that generalize cov, cor, mean, kurtosis, etc from Statistics and StatsBase.jl to work with table types.

1 Like

Taking some arguments from a discussion on correlated :wink: issues

julia> using DataFrames, Statistics

julia> cols = [:x, :y, :z]
3-element Vector{Symbol}:
 :x
 :y
 :z

julia> df = DataFrame(rand(5,3), cols)
5Ɨ3 DataFrame
 Row ā”‚ x         y         z        
     ā”‚ Float64   Float64   Float64
ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€
   1 ā”‚ 0.234054  0.613994  0.557298
   2 ā”‚ 0.318559  0.748663  0.322644
   3 ā”‚ 0.999855  0.167148  0.801001
   4 ā”‚ 0.998599  0.549592  0.291748
   5 ā”‚ 0.932561  0.72125   0.688496

julia> corr=(;zip(cols,Tables.rowtable(DataFrame(cor(Matrix(df)),cols)))...)
(x = (x = 1.0, y = -0.4998593209925881, z = 0.32929593442917343), y = (x = -0.4998593209925881, y = 1.0, z = -0.5560163100630737), z = (x = 0.32929593442917343, y = -0.5560163100630737, z = 1.0))

julia> corr[:x][:x]
1.0

julia> corr[:z][:x]==corr[:x][:z]
true

That works nicely. Thanks.