I am interested in making DataFrames as functional as R and Stata’s ecosystem. So far, I am really impressed with the data munging and processing ecosystem. I think a combination of DataFrames, DataFramesMeta, and Lazy results in a more readable and easy data manipulation process than dplyr
!
One thing that has always bugged me, however, was DataFrames’s describe
function. It’s current behavior simply iterates through all columns and prints the results of StatsBase
’s describe
function (with an added method to accomodate missing values). I have always found this hard to read, since you have to scroll very far to see the results you want.
df = DataFrame(rand(10)
describe(df)
>
Summary Stats:
Mean: 0.607991
Minimum: 0.134861
1st Quartile: 0.443117
Median: 0.633238
3rd Quartile: 0.811435
Maximum: 0.977174
Length: 10
Type: Float64
A solution is to make describe
return a table rather than a list like it’s current behavior. Because the package is for dataframes anyways, it makes sense just to have describe()
return a dataframe.
As @ ExpandingMan pointed out in an earlier thread, the problem is that there isn’t much horizontal space to work with in the REPL, especially if you want to list the names of types of columns, which can get very long.
I wrote something up that has the behavior I think is most useful. It presents less information than the current describe
behavior in 3 ways.
- Doesn’t show the 1st and 3rd quartiles, only
min
,mean
,median
, andmax
. - Doesn’t show the length since all columns would have the same length
- Doesn’t show the full type. Rather, it shows whether the eltype if
<: Real
and if it allows missing values.
using DataFrames, Missings
function my_describe(df::AbstractDataFrame)
function get_stats(col::AbstractArray{T} where T <: Real)
stats = summarystats(col)
t = [stats.mean stats.min stats.median stats.max true false 0]
end
function get_stats(col::AbstractArray{Union{T, Missing}} where T <: Real)
stats = summarystats(collect(skipmissing(col)))
t = [stats.mean stats.min stats.median stats.max true true count(ismissing, col)/length(col)]
end
function get_stats(col)
t = [nothing nothing nothing nothing false false 0]
end
function get_stats(col:: AbstractArray{Union{T, Missing}} where T)
t = [nothing nothing nothing nothing false true count(ismissing, col)/length(col)]
end
sumstats = DataFrame(Variable = Vector{Symbol}(0),
mean = Array{Any,1}(0),
min = Array{Any,1}(0),
median = Array{Any,1}(0),
max = Array{Any,1}(0),
isReal = Array{Bool,1}(0),
allowMissing = Array{Bool,1}(0),
fracMissing = Array{Float64,1}(0))
print(sumstats)
for (name, col) in eachcol(df)
t = [name get_stats(col)]
push!(sumstats, t)
end
return sumstats
end
We can look at its behavior as follows:
# Test that the describe output handles all values and missings properly
# construct the test DataFrame
Variable = [:number, :number_missing, :non_number, :non_number_missing]
Mean = [2.5, 2.0, nothing, nothing]
Min = [1.0, 1.0, nothing, nothing]
Median = [2.5, 2.0, nothing, nothing]
Max = [4.0, 3.0, nothing, nothing]
isReal = [true, true, false, false]
allowMissing = [false, true, false, true]
fracMissing = [0, .25, 0, .25]
describe_output = DataFrame(
Variable = Variable,
mean = Mean,
min = Min,
median = Median,
max = Max,
isReal = isReal,
allowMissing = allowMissing,
fracMissing = fracMissing,
)
# Construct output DataFrame
vec_number = [1, 2, 3, 4]
vec_number_missing = [1,2, 3, missing]
vec_non_number = ["a", "b", "c", "d"]
vec_non_number_missing = ["a", "b", "c", missing]
df = DataFrame(number = vec_number,
number_missing = vec_number_missing,
non_number = vec_non_number,
non_number_missing = vec_non_number_missing)
@test describe_output == my_describe(df)
│ Row │ number │ number_missing │ non_number │ non_number_missing │
├─────┼────────┼────────────────┼────────────┼────────────────────┤
│ 1 │ 1 │ 1 │ "a" │ "a" │
│ 2 │ 2 │ 2 │ "b" │ "b" │
│ 3 │ 3 │ 3 │ "c" │ "c" │
│ 4 │ 4 │ missing │ "d" │ missing │
Gives the result:
│ Row │ Variable │ mean │ min │ median │ max │ isReal │ allowMissing │ fracMissing │
├─────┼────────────────────┼─────────┼─────────┼─────────┼─────────┼────────┼──────────────┼─────────────┤
│ 1 │ number │ 2.5 │ 1.0 │ 2.5 │ 4.0 │ true │ false │ 0.0 │
│ 2 │ number_missing │ 2.0 │ 1.0 │ 2.0 │ 3.0 │ true │ true │ 0.25 │
│ 3 │ non_number │ nothing │ nothing │ nothing │ nothing │ false │ false │ 0.0 │
│ 4 │ non_number_missing │ nothing │ nothing │ nothing │ nothing │ false │ true │ 0.25 │
I have a few questions about this.
- Does this show the information people want? Is it worth having the information presented deviate so much from
describe(x::AbstractArray)
? - Is it even worth it to return a dataframe object? It’s probably not the best way to get a column with the median of each variable. Is it worth it to skip the
dataframe
object and just print a pretty table? Then we could add information about the size of the dataframe without worrying about what object we are returning to the user.