I am interested in making DataFrames as functional as R and Stata’s ecosystem. So far, I am really impressed with the data munging and processing ecosystem. I think a combination of DataFrames, DataFramesMeta, and Lazy results in a more readable and easy data manipulation process than `dplyr`

!

One thing that has always bugged me, however, was DataFrames’s `describe`

function. It’s current behavior simply iterates through all columns and prints the results of `StatsBase`

's `describe`

function (with an added method to accomodate missing values). I have always found this hard to read, since you have to scroll very far to see the results you want.

```
df = DataFrame(rand(10)
describe(df)
>
Summary Stats:
Mean: 0.607991
Minimum: 0.134861
1st Quartile: 0.443117
Median: 0.633238
3rd Quartile: 0.811435
Maximum: 0.977174
Length: 10
Type: Float64
```

A solution is to make `describe`

return a table rather than a list like it’s current behavior. Because the package is for dataframes anyways, it makes sense just to have `describe()`

return a dataframe.

As @ ExpandingMan pointed out in an earlier thread, the problem is that there isn’t much horizontal space to work with in the REPL, especially if you want to list the names of types of columns, which can get very long.

I wrote something up that has the behavior I think is most useful. It presents less information than the current `describe`

behavior in 3 ways.

- Doesn’t show the 1st and 3rd quartiles, only
`min`

,`mean`

,`median`

, and`max`

. - Doesn’t show the length since all columns would have the same length
- Doesn’t show the full type. Rather, it shows whether the eltype if
`<: Real`

and if it allows missing values.

```
using DataFrames, Missings
function my_describe(df::AbstractDataFrame)
function get_stats(col::AbstractArray{T} where T <: Real)
stats = summarystats(col)
t = [stats.mean stats.min stats.median stats.max true false 0]
end
function get_stats(col::AbstractArray{Union{T, Missing}} where T <: Real)
stats = summarystats(collect(skipmissing(col)))
t = [stats.mean stats.min stats.median stats.max true true count(ismissing, col)/length(col)]
end
function get_stats(col)
t = [nothing nothing nothing nothing false false 0]
end
function get_stats(col:: AbstractArray{Union{T, Missing}} where T)
t = [nothing nothing nothing nothing false true count(ismissing, col)/length(col)]
end
sumstats = DataFrame(Variable = Vector{Symbol}(0),
mean = Array{Any,1}(0),
min = Array{Any,1}(0),
median = Array{Any,1}(0),
max = Array{Any,1}(0),
isReal = Array{Bool,1}(0),
allowMissing = Array{Bool,1}(0),
fracMissing = Array{Float64,1}(0))
print(sumstats)
for (name, col) in eachcol(df)
t = [name get_stats(col)]
push!(sumstats, t)
end
return sumstats
end
```

We can look at its behavior as follows:

```
# Test that the describe output handles all values and missings properly
# construct the test DataFrame
Variable = [:number, :number_missing, :non_number, :non_number_missing]
Mean = [2.5, 2.0, nothing, nothing]
Min = [1.0, 1.0, nothing, nothing]
Median = [2.5, 2.0, nothing, nothing]
Max = [4.0, 3.0, nothing, nothing]
isReal = [true, true, false, false]
allowMissing = [false, true, false, true]
fracMissing = [0, .25, 0, .25]
describe_output = DataFrame(
Variable = Variable,
mean = Mean,
min = Min,
median = Median,
max = Max,
isReal = isReal,
allowMissing = allowMissing,
fracMissing = fracMissing,
)
# Construct output DataFrame
vec_number = [1, 2, 3, 4]
vec_number_missing = [1,2, 3, missing]
vec_non_number = ["a", "b", "c", "d"]
vec_non_number_missing = ["a", "b", "c", missing]
df = DataFrame(number = vec_number,
number_missing = vec_number_missing,
non_number = vec_non_number,
non_number_missing = vec_non_number_missing)
@test describe_output == my_describe(df)
```

```
│ Row │ number │ number_missing │ non_number │ non_number_missing │
├─────┼────────┼────────────────┼────────────┼────────────────────┤
│ 1 │ 1 │ 1 │ "a" │ "a" │
│ 2 │ 2 │ 2 │ "b" │ "b" │
│ 3 │ 3 │ 3 │ "c" │ "c" │
│ 4 │ 4 │ missing │ "d" │ missing │
```

Gives the result:

```
│ Row │ Variable │ mean │ min │ median │ max │ isReal │ allowMissing │ fracMissing │
├─────┼────────────────────┼─────────┼─────────┼─────────┼─────────┼────────┼──────────────┼─────────────┤
│ 1 │ number │ 2.5 │ 1.0 │ 2.5 │ 4.0 │ true │ false │ 0.0 │
│ 2 │ number_missing │ 2.0 │ 1.0 │ 2.0 │ 3.0 │ true │ true │ 0.25 │
│ 3 │ non_number │ nothing │ nothing │ nothing │ nothing │ false │ false │ 0.0 │
│ 4 │ non_number_missing │ nothing │ nothing │ nothing │ nothing │ false │ true │ 0.25 │
```

I have a few questions about this.

- Does this show the information people want? Is it worth having the information presented deviate so much from
`describe(x::AbstractArray)`

? - Is it even worth it to return a dataframe object? It’s probably not the best way to get a column with the median of each variable. Is it worth it to skip the
`dataframe`

object and just print a pretty table? Then we could add information about the size of the dataframe without worrying about what object we are returning to the user.