I have a dataframe and I want to get the correlation matrix with row/column titles so I know what Iām looking at. If I do this
cormat = cor(Matrix(df))
There are no row/column titles. How do I get a correlation matrix / dataframe with row and column titles?
1 Like
You can use NamedArrays.jl
for labeled axes.
julia> na = NamedArray(Matrix(df))
5Ć3 Named Matrix{Float64}
A ā² B ā 1 2 3
āāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
1 ā 0.887614 0.695876 0.0307668
2 ā 0.133595 0.428006 0.400858
3 ā 0.149537 0.868851 0.612551
4 ā 0.179476 0.371094 0.976713
5 ā 0.565508 0.990754 0.0458062
julia> setnames!(na, names(df), 2)
(OrderedCollections.OrderedDict{Any, Int64}("1" => 1, "2" => 2, "3" => 3, "4" => 4, "5" => 5), OrderedCollections.OrderedDict{Any, Int64}("x" => 1, "y" => 2, "z" => 3))
julia> na
5Ć3 Named Matrix{Float64}
A ā² B ā x y z
āāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
1 ā 0.887614 0.695876 0.0307668
2 ā 0.133595 0.428006 0.400858
3 ā 0.149537 0.868851 0.612551
4 ā 0.179476 0.371094 0.976713
5 ā 0.565508 0.990754 0.0458062
julia> cor(na)
3Ć3 Named Matrix{Float64}
B ā² B ā x y z
āāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
x ā 1.0 0.406237 -0.783476
y ā 0.406237 1.0 -0.588138
z ā -0.783476 -0.588138 1.0
1 Like
Would something like this help?
using DataFrames, Statistics
cols = [:x, :y, :z]
df = DataFrame(rand(5,3), cols)
dfcor = [cols DataFrame(cor(Matrix(df)), cols)]
Row ā x1 x y z
ā Symbol Float64 Float64 Float64
āāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
1 ā x 1.0 -0.0635502 0.252657
2 ā y -0.0635502 1.0 0.370023
3 ā z 0.252657 0.370023 1.0
The code below is not super efficient, but I often use it:
NamedArray(cor(Matrix(df)), (names(df), names(df)))
3 Likes
I personally prefer AxisArrays.jl:
julia> C = AxisArray(rand(3,3), row=[:a,:b,:c], col=[:a,:b,:c])
2-dimensional AxisArray{Float64,2,...} with axes:
:row, [:a, :b, :c]
:col, [:a, :b, :c]
And data, a 3Ć3 Matrix{Float64}:
0.0344525 0.210511 0.365664
0.270243 0.319163 0.113199
0.960284 0.832544 0.105832
julia> C[:a,:b]
0.21051123926125237
julia> C[1,2]
0.21051123926125237
This is something that should honestly be made easier. Perhaps a DataFrameStats.jl convenience package where operations like this are already implemented behind the scenes. I always run into this pain.
2 Likes
The printing could definitely be improved here though. The NamedArrays.jl
version is easier to read because it lists the column and row names directly.
1 Like
I would make it TableStats.jl and would try to support any Tables.jl though. A set of basic convenience functions that generalize cov, cor, mean, kurtosis, etc from Statistics and StatsBase.jl to work with table types.
1 Like
Taking some arguments from a discussion on correlated issues
julia> using DataFrames, Statistics
julia> cols = [:x, :y, :z]
3-element Vector{Symbol}:
:x
:y
:z
julia> df = DataFrame(rand(5,3), cols)
5Ć3 DataFrame
Row ā x y z
ā Float64 Float64 Float64
āāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
1 ā 0.234054 0.613994 0.557298
2 ā 0.318559 0.748663 0.322644
3 ā 0.999855 0.167148 0.801001
4 ā 0.998599 0.549592 0.291748
5 ā 0.932561 0.72125 0.688496
julia> corr=(;zip(cols,Tables.rowtable(DataFrame(cor(Matrix(df)),cols)))...)
(x = (x = 1.0, y = -0.4998593209925881, z = 0.32929593442917343), y = (x = -0.4998593209925881, y = 1.0, z = -0.5560163100630737), z = (x = 0.32929593442917343, y = -0.5560163100630737, z = 1.0))
julia> corr[:x][:x]
1.0
julia> corr[:z][:x]==corr[:x][:z]
true
That works nicely. Thanks.
Hello, do you know if there have been any developments on these aspects since the original post? Any new packages to do that smoothly with Dataframes ?
Thank you!