How to get a quick description of a vector/matrix like `pd.DataFrame.describe`

When in python pandas I find pd.DataFrame.describe a helpful tool to get some quick insight about my DataFrame, particularly when using large DataFrames in the REPL. It works as follows, from the linked docs:

>>> s = pd.Series([1, 2, 3])
>>> s.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

Is there anything similar in Julia?

Yes:

julia> df = DataFrame(x1 = rand('a':'z', 50), x2 = rand(50), x3 = rand(Int, 50));

julia> describe(df)
3Γ—7 DataFrame
 Row β”‚ variable  mean         min                   median       max                  nmissing  eltype
     β”‚ Symbol    Union…       Any                   Union…       Any                  Int64     DataType
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚ x1                     b                                  z                           0  Char
   2 β”‚ x2        0.557368     0.0171211             0.61268      0.980724                    0  Float64
   3 β”‚ x3        -8.54368e17  -9192866872846841147  -1.90444e18  8714825590927121417         0  Int64

you can also pass extra arguments to customize which metrics are calculated:

julia> describe(df, :q75, :nunique, :first, :last)
3Γ—5 DataFrame
 Row β”‚ variable  q75         nunique  first                last
     β”‚ Symbol    Union…      Union…   Any                  Any
─────┼────────────────────────────────────────────────────────────────────────
   1 β”‚ x1                    22       k                    d
   2 β”‚ x2        0.766689             0.795076             0.252337
   3 β”‚ x3        3.48812e18           3819380600561978128  913260025103986027

(allowed arguments are :mean, :std, :min, :q25, :median, :q75, :max, :nunique, :nmissing, :first, :last, :eltype)

2 Likes

When I used Python I mainly used pandas, but for my Julia work I tend to use Vectors and Matrices. It’s good to know that DataFrames.jl has that functionality, and it’s clearly a valid answer. Interestingly this method does accept vectors, but not matrices

julia> describe(ones(10))
Summary Stats:
Length:         10
Missing Count:  0
Mean:           1.000000
Minimum:        1.000000
1st Quartile:   1.000000
Median:         1.000000
3rd Quartile:   1.000000
Maximum:        1.000000
Type:           Float64

julia> describe(ones(10,10))
ERROR: MethodError: no method matching quantile!(::Matrix{Float64}, ::Vector{Float64}; sorted=false, alpha=1.0, beta=1.0)...

It seems that StatsBase.jl is implementing the describe and tries to work out quantiles over the entire matrix which it doesn’t have a method for. One solution is:

julia> describe(ones(10,10)[:])
Summary Stats:
Length:         100
Missing Count:  0
Mean:           1.000000
Minimum:        1.000000
1st Quartile:   1.000000
Median:         1.000000
3rd Quartile:   1.000000
Maximum:        1.000000
Type:           Float64

But it would probably be best for describe to have a dims argument so it can work row/column wise. Unless you know of this functionality elsewhere I might open a ticket with StatsBase

If you want a summary of each column, you could perhaps broadcast the describe function

describe.(eachcol(ones(10,10)))
1 Like

This does give the correct information but the output is quite ugly as it’s 10x Vector description. It’d be better if the description could maintain the matrices columns, as it would with pandas.DataFrame.describe and the simple describe(ones(10,10)) syntax.

You could just do

describe(DataFrame(mymatrix, :auto; copycols = false))

or even define that method lcoally for describe(x::AbstractMatrix) locally if you want to save on keystrokes.

True, but I intend on doing this a lot, so I’d rather not have to keep creating DataFrames and saving the keystrokes would be preferred. As you suggest, to implement it would be rather trivial, if I get time I think I’ll make a PR for this of StatsBase. Thanks for your help