Get Correlation matrix from dataframe with row and column titles

DWSchulze · May 19, 2022, 7:04am

I have a dataframe and I want to get the correlation matrix with row/column titles so I know what I’m looking at. If I do this

cormat = cor(Matrix(df))

There are no row/column titles. How do I get a correlation matrix / dataframe with row and column titles?

skleinbo · May 19, 2022, 7:24am

You can use NamedArrays.jl for labeled axes.

julia> na = NamedArray(Matrix(df))
5×3 Named Matrix{Float64}
A ╲ B │         1          2          3
──────┼────────────────────────────────
1     │  0.887614   0.695876  0.0307668
2     │  0.133595   0.428006   0.400858
3     │  0.149537   0.868851   0.612551
4     │  0.179476   0.371094   0.976713
5     │  0.565508   0.990754  0.0458062

julia> setnames!(na, names(df), 2)
(OrderedCollections.OrderedDict{Any, Int64}("1" => 1, "2" => 2, "3" => 3, "4" => 4, "5" => 5), OrderedCollections.OrderedDict{Any, Int64}("x" => 1, "y" => 2, "z" => 3))

julia> na
5×3 Named Matrix{Float64}
A ╲ B │         x          y          z
──────┼────────────────────────────────
1     │  0.887614   0.695876  0.0307668
2     │  0.133595   0.428006   0.400858
3     │  0.149537   0.868851   0.612551
4     │  0.179476   0.371094   0.976713
5     │  0.565508   0.990754  0.0458062

julia> cor(na)
3×3 Named Matrix{Float64}
B ╲ B │         x          y          z
──────┼────────────────────────────────
x     │       1.0   0.406237  -0.783476
y     │  0.406237        1.0  -0.588138
z     │ -0.783476  -0.588138        1.0

rafael.guerra · May 19, 2022, 8:50am

Would something like this help?

using DataFrames, Statistics
cols = [:x, :y, :z]
df = DataFrame(rand(5,3), cols)

dfcor = [cols DataFrame(cor(Matrix(df)), cols)]


 Row │ x1      x           y           z        
     │ Symbol  Float64     Float64     Float64
─────┼──────────────────────────────────────────
   1 │ x        1.0        -0.0635502  0.252657
   2 │ y       -0.0635502   1.0        0.370023
   3 │ z        0.252657    0.370023   1.0

bkamins · May 19, 2022, 9:11am

The code below is not super efficient, but I often use it:

NamedArray(cor(Matrix(df)), (names(df), names(df)))

juliohm · May 19, 2022, 12:28pm

I personally prefer AxisArrays.jl:

julia> C = AxisArray(rand(3,3), row=[:a,:b,:c], col=[:a,:b,:c])
2-dimensional AxisArray{Float64,2,...} with axes:
    :row, [:a, :b, :c]
    :col, [:a, :b, :c]
And data, a 3×3 Matrix{Float64}:
 0.0344525  0.210511  0.365664
 0.270243   0.319163  0.113199
 0.960284   0.832544  0.105832

julia> C[:a,:b]
0.21051123926125237

julia> C[1,2]
0.21051123926125237

tbeason · May 19, 2022, 12:31pm

This is something that should honestly be made easier. Perhaps a DataFrameStats.jl convenience package where operations like this are already implemented behind the scenes. I always run into this pain.

pdeffebach · May 19, 2022, 12:33pm

The printing could definitely be improved here though. The NamedArrays.jl version is easier to read because it lists the column and row names directly.

juliohm · May 19, 2022, 12:49pm

I would make it TableStats.jl and would try to support any Tables.jl though. A set of basic convenience functions that generalize cov, cor, mean, kurtosis, etc from Statistics and StatsBase.jl to work with table types.

rocco_sprmnt21 · May 19, 2022, 5:09pm

Taking some arguments from a discussion on correlated issues

julia> using DataFrames, Statistics

julia> cols = [:x, :y, :z]
3-element Vector{Symbol}:
 :x
 :y
 :z

julia> df = DataFrame(rand(5,3), cols)
5×3 DataFrame
 Row │ x         y         z        
     │ Float64   Float64   Float64
─────┼──────────────────────────────
   1 │ 0.234054  0.613994  0.557298
   2 │ 0.318559  0.748663  0.322644
   3 │ 0.999855  0.167148  0.801001
   4 │ 0.998599  0.549592  0.291748
   5 │ 0.932561  0.72125   0.688496

julia> corr=(;zip(cols,Tables.rowtable(DataFrame(cor(Matrix(df)),cols)))...)
(x = (x = 1.0, y = -0.4998593209925881, z = 0.32929593442917343), y = (x = -0.4998593209925881, y = 1.0, z = -0.5560163100630737), z = (x = 0.32929593442917343, y = -0.5560163100630737, z = 1.0))

julia> corr[:x][:x]
1.0

julia> corr[:z][:x]==corr[:x][:z]
true

DWSchulze · May 19, 2022, 5:16pm

That works nicely. Thanks.

fdekerme · February 29, 2024, 2:21pm

Hello, do you know if there have been any developments on these aspects since the original post? Any new packages to do that smoothly with Dataframes ?
Thank you!

Topic		Replies	Views
First impression of DataFrames.jl New to Julia dataframes	4	1908	November 8, 2020
Newbie : Accessing DataFrame with row and column names New to Julia dataframes	5	1906	February 19, 2020
Whats the easiest way to create correlation matrices in Julia? New to Julia question , statistics	5	4289	November 5, 2021
Matrix Column/Row Labelling General Usage dataframes , matrices	3	1447	July 11, 2022
Is it possible, that DataFrame row has a name, like colums have? For instance: df[:GR,:col] New to Julia dataframes	2	423	January 17, 2020

Get Correlation matrix from dataframe with row and column titles

Related topics