A proposal for `describe` of a DataFrame

dataframes

#1

I am interested in making DataFrames as functional as R and Stata’s ecosystem. So far, I am really impressed with the data munging and processing ecosystem. I think a combination of DataFrames, DataFramesMeta, and Lazy results in a more readable and easy data manipulation process than dplyr!

One thing that has always bugged me, however, was DataFrames’s describe function. It’s current behavior simply iterates through all columns and prints the results of StatsBase's describe function (with an added method to accomodate missing values). I have always found this hard to read, since you have to scroll very far to see the results you want.

df = DataFrame(rand(10)
describe(df)
>
Summary Stats:
Mean:           0.607991
Minimum:        0.134861
1st Quartile:   0.443117
Median:         0.633238
3rd Quartile:   0.811435
Maximum:        0.977174
Length:         10
Type:           Float64

A solution is to make describe return a table rather than a list like it’s current behavior. Because the package is for dataframes anyways, it makes sense just to have describe() return a dataframe.

As @ ExpandingMan pointed out in an earlier thread, the problem is that there isn’t much horizontal space to work with in the REPL, especially if you want to list the names of types of columns, which can get very long.

I wrote something up that has the behavior I think is most useful. It presents less information than the current describe behavior in 3 ways.

  1. Doesn’t show the 1st and 3rd quartiles, only min, mean, median, and max.
  2. Doesn’t show the length since all columns would have the same length
  3. Doesn’t show the full type. Rather, it shows whether the eltype if <: Real and if it allows missing values.
using DataFrames, Missings


function my_describe(df::AbstractDataFrame)
    function get_stats(col::AbstractArray{T} where T <: Real)
        stats = summarystats(col)
        t = [stats.mean  stats.min stats.median  stats.max true false 0]
    end
    
    function get_stats(col::AbstractArray{Union{T, Missing}} where T <: Real)
        stats = summarystats(collect(skipmissing(col)))
 	t = [stats.mean stats.min stats.median stats.max true true count(ismissing, col)/length(col)]
    end
    
    function get_stats(col)
        t = [nothing nothing nothing nothing false false 0]
    end
    
    function get_stats(col:: AbstractArray{Union{T, Missing}} where T)
        t = [nothing nothing nothing nothing false true count(ismissing, col)/length(col)]
    end 

    sumstats = DataFrame(Variable = Vector{Symbol}(0), 
        mean = Array{Any,1}(0), 
        min = Array{Any,1}(0), 
        median = Array{Any,1}(0), 
        max = Array{Any,1}(0), 
        isReal = Array{Bool,1}(0), 
        allowMissing = Array{Bool,1}(0), 
        fracMissing = Array{Float64,1}(0))
    print(sumstats)
    for (name, col) in eachcol(df)
        t = [name get_stats(col)]
        push!(sumstats, t)
    end
    return sumstats
end

We can look at its behavior as follows:

# Test that the describe output handles all values and missings properly
    # construct the test DataFrame
    Variable = [:number, :number_missing, :non_number, :non_number_missing]
    Mean = [2.5, 2.0, nothing, nothing]
    Min = [1.0, 1.0, nothing, nothing]
    Median = [2.5, 2.0, nothing, nothing]
    Max = [4.0, 3.0, nothing, nothing]
    isReal = [true, true, false, false]
    allowMissing = [false, true, false, true]
    fracMissing = [0, .25, 0, .25]
    describe_output = DataFrame(
        Variable = Variable,
        mean = Mean,
        min = Min, 
        median = Median, 
        max = Max, 
        isReal = isReal,
        allowMissing = allowMissing,
        fracMissing = fracMissing, 
        )

    # Construct output DataFrame
    vec_number = [1, 2, 3, 4]
    vec_number_missing = [1,2, 3, missing]
    vec_non_number = ["a", "b", "c", "d"] 
    vec_non_number_missing = ["a", "b", "c", missing]  
    
    df = DataFrame(number = vec_number, 
        number_missing = vec_number_missing, 
        non_number = vec_non_number, 
        non_number_missing = vec_non_number_missing)

    @test describe_output == my_describe(df)
│ Row │ number │ number_missing │ non_number │ non_number_missing │
├─────┼────────┼────────────────┼────────────┼────────────────────┤
│ 1   │ 1      │ 1              │ "a"        │ "a"                │
│ 2   │ 2      │ 2              │ "b"        │ "b"                │
│ 3   │ 3      │ 3              │ "c"        │ "c"                │
│ 4   │ 4      │ missing        │ "d"        │ missing            │

Gives the result:

│ Row │ Variable           │ mean    │ min     │ median  │ max     │ isReal │ allowMissing │ fracMissing │
├─────┼────────────────────┼─────────┼─────────┼─────────┼─────────┼────────┼──────────────┼─────────────┤
│ 1   │ number             │ 2.5     │ 1.0     │ 2.5     │ 4.0     │ true   │ false        │ 0.0         │
│ 2   │ number_missing     │ 2.0     │ 1.0     │ 2.0     │ 3.0     │ true   │ true         │ 0.25        │
│ 3   │ non_number         │ nothing │ nothing │ nothing │ nothing │ false  │ false        │ 0.0         │
│ 4   │ non_number_missing │ nothing │ nothing │ nothing │ nothing │ false  │ true         │ 0.25        │

I have a few questions about this.

  1. Does this show the information people want? Is it worth having the information presented deviate so much from describe(x::AbstractArray)?
  2. Is it even worth it to return a dataframe object? It’s probably not the best way to get a column with the median of each variable. Is it worth it to skip the dataframe object and just print a pretty table? Then we could add information about the size of the dataframe without worrying about what object we are returning to the user.

#2

Nice idea. You could get more compact printing by substituting something else for nothing (maybe a Unicode symbol, but I don’t have a good suggestion).

Also, the columns allowMissing and fracMissing could be combined into one with some notation, eg =0 for non-allowed missing and a simple percentage otherwise (eg 25%). Also with a more compact header. Printing the type would be better IMO instead of isReal, which does not save much space. Finally, you could define a type for the summary, and then just a show method for it.

However, the reason I like the current describe is precisely because it does not force a tabular format. With a dataframe of integers, floats, and symbols, it is hard to make the display tabular since they don’t share statistics. Eg for categorical data I would print the 5 most frequent symbols, etc.


#3

I think in general this is a good idea. I’m not fully on board with your suggested columns but I do like the format.

I think instead of is real and allows missing, you just need a single column that actually prints the type. Union{Missing, Int64} tells me everything I need to know there. I think it is even more informative actually. Consider a column of Bool. You’d have false and false printed. So what’s the type!?

I’d also prefer the count of missing like it is now vs the fraction missing.

For length, why not print it like it is printed when show is called on the dataframe itself?

I don’t care about exactly how the results are returned but I do like the horizontal printing.


#4

One issue with combining allowMissing and combineMissing is that a column might allow missing values, but not have any at the moment. So trying to make something missing would still thrown an error even if it has 0 missings.

I think Stata’s ‘.’ for missing values is nice. I would do it if we switched from actual dataframe to an aligned table.


#5

The issue with printing the whole type is that they are really long. Union{Missing, Int64} is on the shorter end, but Array{Union{Array{Int64,1}, Missings.Missing},1} would be pretty unwieldy.

My logic for <:Real is that most of the bugs that the user would get would be due to parsing, and reading data in as a string when it should be a number. So this allows you to catch that specific error.


#6

Interesting ideas. I also think we’d better print the actual type rather than only isReal and allowMissing, as it’s useful to distinguish categorical columns from String columns, and integers from floating points. Array{Union{Array{Int64,1}, Missings.Missing},1} is a pretty atypical use case anyway. There are also plans to find a shorter representation for Union{T,Missing}, e.g. T?. Another solution would be to omit type information entirely, and have people use showcols to obtain that information. That could make sense if it allows including more stats in the freed horizontal space, notably the quantiles.

Missing values should also be skipped when computing stats, as the presence of missing values is already indicated in a dedicated column.

I’m not sure what do do with for non-numeric columns. We currently only print the number of unique values. It could be included in one of the columns (e.g. mean/uniques). We should have a look at how other software handle this.


#7

Sounds very interesting, I wonder if the choice of the summary functions can be left to the user. I.e. something like:

describe(df, summaryfunctions = (mean, minimum, maximum, median, std))


#8

Stata

Stata has the following output from summarize. Non-numeric variables are said to have 0 observations.

I agree with @ point about showing the number of missings rather than the fraction of missings, because I forgot how useful that information is in Stata.

Note that Stata doesn’t need to tell me which columns allow missing values, since all columns allow missing values. It also doesn’t need to show me the types of variables in the dataset because Stata has a very limited number of types. It also doens’t tell me which variables are factor variables because Stata’s “factor” settings aren’t known to the .dta as a whole, you set them for individual functions (i.variable). Additionally, Stata has the option detail on summarize allowing an output more similar to other languages. A workflow like this, going in more detail if you want, could easily be ported to Julia because DataFrames has a describe function for columns.

summarize >
    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        make |          0
       price |         74    6165.257    2949.496       3291      15906
         mpg |         74     21.2973    5.785503         12         41
       rep78 |         69    3.405797    .9899323          1          5
    headroom |         74    2.993243    .8459948        1.5          5
-------------+---------------------------------------------------------
       trunk |         74    13.75676    4.277404          5         23
      weight |         74    3019.459    777.1936       1760       4840
      length |         74    187.9324    22.26634        142        233
        turn |         74    39.64865    4.399354         31         51
displacement |         74    197.2973    91.83722         79        425
-------------+---------------------------------------------------------
  gear_ratio |         74    3.014865    .4562871       2.19       3.89
     foreign |         74    .2972973    .4601885          0          1

summarize price, det >

                            Price
-------------------------------------------------------------
      Percentiles      Smallest
 1%         3291           3291
 5%         3748           3299
10%         3895           3667       Obs                  74
25%         4195           3748       Sum of Wgt.          74

50%       5006.5                      Mean           6165.257
                        Largest       Std. Dev.      2949.496
75%         6342          13466
90%        11385          13594       Variance        8699526
95%        13466          14500       Skewness       1.653434
99%        15906          15906       Kurtosis       4.819188

R

dplyr's behavior is exactly like Julia’s, except it prints the statistics from each row horizontally.

 carrier              flight       tailnum             origin         
Length:336776      Min.   :   1   Length:336776      Length:336776     
Class :character   1st Qu.: 553   Class :character   Class :character  
Mode  :character   Median :1496   Mode  :character   Mode  :character  
                   Mean   :1972                                        
                   3rd Qu.:3465                                        
                   Max.   :8500
                   NA's   :1     

I think this way of printing is uglier than Julia’s, given the way text wraps in the REPL. I could have sworn there was something that printed a tabular format in R, but now I can’t find it.

One thing to note is that if I recall correctly, R’s dataframes also doesn’t have first class support of user-defined types, so class names are guaranteed to be small.

Panda’s is similar to R’s implementation and Julia’s implementation.

I really think that a tabular representation of summary statistics is useful, and offers an opportunity to port a useful feature from Stata into Julia. However its impossible to fit all the information we would like into a reasonable width. Here is a proposal.

  1. replace all nothing values with - to save space.
  2. Replace the two Missings columns with a single NMissing column, that shows - if the column doesn’t allow missings and 0 if it does allow missings but doesn’t have any at the moment.
  3. Don’t return a dataframe object. Return an object that has the necessary information as fields for each variable. Then we use that information to print a subset of that information, but the user can always call, say summary.variable for more information. However a benefit of just printing information and not storing it is that dataframes with tons and tons of rows will have to create this big object whenever you call describe.

So here is a new summary table:

│ Variable           │ mean    │ min     │ median  │ max     │ isReal │ NMissing │
├────────────────────┼─────────┼─────────┼─────────┼─────────┼────────┼──────────┤
│ number             │ 2.5     │ 1.0     │ 2.5     │ 4.0     │ true   │ -        │
│ number_missing     │ 2.0     │ 1.0     │ 2.0     │ 3.0     │ true   │ 1        │
│ non_number         │ -       │ -       │ -       │ -       │ false  │ -        │
│ non_number_missing │ -       │ -       │ -       │ -       │ false  │ 1        │

#9

OK, so basically Stata doesn’t handle non-numeric columns.

Maybe this package? https://ropensci.org/blog/2017/07/11/skimr/
It’s interesting, but I’m not sure I like the idea of separating numeric from string variables, which breaks the ordering of the columns and prevents returning a single object holding all the stats. Also the small histograms look nice but in practice I wonder whether it’s really a good usage of the limited screen space.

Or even better, do not print anything at all, just like what happens at the REPL.

Good idea. Or maybe use nothing, which would be printed as empty. That would give a better type for the column than mixing numbers and characters (due to -).

What’s the advantage of using a custom object rather than a data frame? It already implements all the printing logic, with support for pretty printing/browsing in GUIs (not sure whether it works in Juno currently but at some point it did).

I’m not convinced it’s useful to have an isReal column: it’s quite obvious from the stats whether a column is real or not. The exact type would be a more useful information, if we can afford the space.


#10

Good idea. Or maybe use nothing, which would be printed as empty. That would give a better type for the column than mixing numbers and characters (due to -).

I couldn’t figure out how to get DataFrames to print nothings as empty. Is that something on master?

What’s the advantage of using a custom object rather than a data frame? It already implements all the printing logic, with support for pretty printing/browsing in GUIs (not sure whether it works in Juno currently but at some point it did).

I agree with this. Returning a dataframe means you get to piggyback off of a lot of print functionality. I hate in R how you suddenly get an object with tons of lists you have to awkwardly navigate. Broom is great, but is a symptom of a disorganized ecosystem.

I’m not convinced it’s useful to have an isReal column: it’s quite obvious from the stats whether a column is real or not. The exact type would be a more useful information, if we can afford the space.

Perhaps a column showing the type of number? Int64 vs. Boolean vs. factor variables.


#11

I still think printing the actual column type would be better, but a compromise would be to print it’s eltype (I think this is what eltype refers to anyway). So instead of Union{Missing,Float64} we see Float64. Then if there is any confusion about actual specific column types, the user can either use showcols or some other method to inspect that information.

I’m on the fence about skipping missing values by default in the summary stats computation, but I guess I’m leaning that way since that is what other softwares do. So I’d back @nalimilan suggestion to do that.


#12

Ah, not, that’s a limit. We could decide to always print nothing as an empty cell in data frames.

Yes, that’s what I meant.


#13

I still think printing the actual column type would be better, but a compromise would be to print it’s eltype (I think this is what eltype refers to anyway). So instead of Union{Missing,Float64} we see Float64.

The problem is that even non-missing types can be really long as well, especially if the type is from a package, like DataFrames.Dataframe.

I’m on the fence about skipping missing values by default
I am very pro skipping missing values by default, especially since the user will see whether or not a column allows missing values and how many of them there are.

A PR to DataFrames to show nothing as blank would be easy, since the actual function used isn’t Base.show, but rather an internally defined function.


#14

Can you give some examples of commonly used types that are super long? The elements of DataFrame columns that I use are typically single scalar values or strings, just like in Stata, SAS, R, …etc. So the longest thing I would expect here would be like Categorical or some derivative of that.


#15

I agree that I don’t use much beyond String, Float64 etc. I was under the impression it was somewhat common to use custom types in DataFrames, or even have a DataFrame which has arrays. I could be mistaken, though.

If someone is in the habit of using long-named types in their dataframes, they should chime in.


#16

Fortunately, on Julia 0.7 the name of the package is omitted for exported types.


#17

Reviving this thread because I got around to writing a function with more complete functionality

  1. You can now specify which statistics you want to see (from a limited selection)
  2. It now shows the number of missing values, instead of the fraction, with nothing if the column doesn’t allow missings
  3. Now prints the full type of the variable, instead of just if its a number or not. It also prints this at the end if you have a super long type name you can still read all the other columns next to each other.

Some thoughts:

  1. What about long variable names? Stata, which I am trying to emulate here, shortens the names of variables, but I for one find this pretty annoying.
  2. Are there plans for DataFrames to have alternative displays of nothing and missing? Could make the tables (and DataFrames in general) more readable.

Here is the code I’ve been working with:

PS: Please let me know if I should make this an issue in DataFrames or go ahead and submit a full PR.

using DataFrames, Missings


function my_describe(df::AbstractDataFrame; colstats = [:mean, :min, :median, :max, :NMissing, :type])
	# Check that people don't specify the wrong fields. 
	allowed_fields = [:mean, :sd, :min, :q25, :median, :q75, :max, :type, :NUnique,:NMissing]
	for i in colstats 
		if !contains(==, allowed_fields, i) 
			error("Not an allowed field. Allowed fields are:\n:mean,\n:min,\n:q25,\n:median,\n:q75,\nmax,\n:type,\n:NMissing,\n:NUnique") 
		end
	end
	# Define 4 functions for getting summary statistics 
	# use a dict because we dont know which measures the user wants
    function get_stats(col::AbstractArray{T} where T <: Real)
        stats = summarystats(col)
        stats_dict = Dict(
        	:mean => stats.mean,
        	:sd => sqrt(var(col)),
        	:min => stats.min ,
        	:q25 => stats.q25,
        	:median => stats.median,
        	:q75 => stats.q75,
        	:max => stats.max,
        	:type => eltype(col),
        	:NMissing => nothing,
        	:NUnique => nothing
        )
    end
    
    function get_stats(col::AbstractArray{Union{T, Missing}} where T <: Real)
        stats = summarystats(collect(skipmissing(col)))
        stats_dict = Dict(
        	:mean => stats.mean,
        	:sd => sqrt(var(collect(skipmissing(col)))),
        	:min => stats.min ,
        	:q25 => stats.q25,
        	:median => stats.median,
        	:q75 => stats.q75,
        	:max => stats.max,
        	:type => Missings.T(eltype(col)),
        	:NMissing => count(ismissing(col)),
        	:NUnique => nothing
        )
    end
    
    function get_stats(col)
        stats_dict = Dict(
        	:mean => nothing,
        	:sd => nothing,
        	:min => nothing,
        	:q25 => nothing,
        	:median => nothing,
        	:q75 => nothing,
        	:max => nothing,
        	:type => eltype(col),
        	:NMissing => nothing,
        	:NUnique => length(unique(col))
        )   
    end
    
    function get_stats(col:: AbstractArray{Union{T, Missing}} where T <: Any)
        stats_dict = Dict(
            :mean => nothing,
        	:sd => nothing,
        	:min => nothing,
        	:q25 => nothing,
        	:median => nothing,
        	:q75 => nothing,
        	:max => nothing,
        	:type => Missings.T(eltype(col)),
        	:NMissing => count(ismissing(col)),
        	:NUnique => length(unique(col))
        )    
    end 
    # Takes in a column and returns a row vector of the statistics
    function spread(col)
    	d = get_stats(col)
    	row = Array{Any}(1,0)
    	for f in colstats
    		row = [row d[f]]
    	end
    	row
    end
    # Append the summary stats into a matrix
    data = Array{Any}(0,length(colstats))
    for (name, col) in eachcol(df) 
    	data = [data; spread(col)]
    end
   # Add a column for variable names
   data = DataFrame([names(df) data])
   # Add the names for the Variable column and all the summary stats
   names!(data, [:Variable; colstats])
end

    # Construct output DataFrame 
    vec_number = [1, 2, 3, 4]
    vec_number_missing = [1,2, 3, missing]
    vec_non_number = ["a", "b", "c", "d"] 
    vec_non_number_missing = ["a", "b", "c", missing]  

    struct MySuperLongNameForAStructIsThisToolong end
    long_struct = [MySuperLongNameForAStructIsThisToolong() for i in 1:4]
    
    df = DataFrame(number = vec_number, 
        number_missing = vec_number_missing, 
        non_number = vec_non_number, 
        non_number_missing = vec_non_number_missing,
    	long_struct = long_struct)

my_describe(df)


#18

Yes, please make a PR so that we can discuss this more precisely. I agree abbreviating variable names isn’t a good idea, anyway if we did that it would have to happen when printing the data frame, not when creating it. Regarding the display of missing and nothing, we already print missing in grey to make it easy to spot, and I guess we could print nothing as empty since that’s how its printed at the REPL.