Request for un'stdlibfication of Statistics

The multiplicity of Stat_.jl has been notably a recurring issue for new and more experienced Julians. There is an unpersuasive historical story to this, imo no longer of any relevance except for the stickiness it has exhibited.

It would be more becoming for Julia to have review and reshuffle the statistical package’s supportive membrane to magnify intra-functional simplicity and inter-operational effervescence (figuratively and factually). Re-looking and agreeing on some few helpful and unharmful moves may garner the active focus that does it.

Ping me for the third participant :technologist:t2: .

7 Likes

Such an elegant exposition @JeffreySarnoff . I wish that my English skills will become that good someday :slight_smile:

3 Likes

Moving Statistics out of the standard libraries seems like a good idea to me overall. However, what about mean? Can we move mean back into Base? :joy:

17 Likes

Regarding the dims argument, I have a PR to support pairwise(f, eachcol(tbl)) on Tables in FreqTables.jl. AFAICT this kind of pattern would replace most uses of dims while being more explicit and working both for arrays and tables (arrays are useful, I don’t think we can require using tables).

But for this to work we need changes in Tables.jl so that eachcol returns an AbstractColumns object so that we can dispatch on it. Otherwise there’s no way to know whether an object is a table or not (which is a problem in many places).

Another difficulty is that there’s no standard package for arrays with names. FreqTables uses NamedArrays, but there are other options (AxisArrays, AxisKeys…) without any clear winner, so it’s hard to decide that StatsBase or another central package should depend on one of them. Yet as soon as you take a Tables.jl object as an input, you’ll want to preserve column names by returning a named array… I think the solution for now is to develop a specialized package (e.g. TableStatistics.jl) until it’s clear that we want to merge these features into Statistics/StatsBase.

12 Likes

This problem exists because you are trying to support arrays in parallel to tables.

Of course there is!

coltable = (a=[1,2,3], b=[4,5,6])

rowtable = [(a=1,b=4), (a=2,b=5), (a=3,b=6)]

all built-in types to Julia that implement the Tables.jl API.

module Foo
  mean(x) = sum(x) / length(x)
end

will do it in most use cases. Unless you need extra options like weights etc, in which case you can certainly add a statistics-maintained package as a dependency for full-featured mean.

1 Like

Given your concern for new users and package discoverability, I would think you would prefer that new users be able to do this in a fresh Julia session:

julia> x = rand(4); mean(x)
0.6715977345843244

It’s bad enough that that doesn’t actually work in a vanilla Julia session (no installed packages) without first doing using Statistics.

34 Likes

Not really, as one can dispatch on AbstractArray for these. The problem is rather that not everybody is happy with the idea of having essential stats packages depend on Tables.jl and on a package supporting named arrays.

And what kind of object would pairwise(cor, eachcol(coltable)) (or cor(eachcol(coltable))) return? Returning a table wouldn’t be appropriate as the role of rows and columns in a correlation matrix is symmetric. It really needs to be a matrix with row and column names.

7 Likes

I would rather point beginners to a cohesive ecosystem for stats with tabular data. The function mean by itself doesn’t justify the current situation with Statistics.

1 Like

Tables.jl is a super lightweight package with a very flexible API that is used for all kinds of things. I never encountered a single limitation with it doing all kinds of complicated geometric/statistical processing in Meshes.jl/GeoStats.jl for example. I really think that we, as a community, should stick to its API in favor of a more user-friendly experience for beginners. If someone really thinks that there is a real need for other concept that is not tabular data in the ecosystem, then they can work separately on a package for that. We shouldn’t compromise the experience of 99% of users because 1% feels unhappy with tabular representations in statistics.

Yes, correlation/covariance matrices are not tables, they are matrices (i.e. arrays). I personally think that a good design should return an AxisArray (or any new custom type developed by us) so that we don’t loose track of the variable names ever during our statistical workflows.

A tabular statistics package could have a correlation table function returning an |x|y|cor| table.

Yes, it could be anything we want, but I doubt that use cases will favor this choice. We compute covariances matrices to perform matrix decompositions, sample Gaussian processes, etc. We never use them as tables. That is why it is important to think of use cases and stick to them throughout the ecosystem.

I don’t think I understand the objection here. Consider the following:

  • You write a package TableStatistics.jl under the Tables aegis that imports Statistics.mean
  • You add new methods for mean(t::Table) etc. and export them
  • Your users do using TableStatistics and now they can do statistics on Tables

That seems simple and elegant, no?

9 Likes

Also, FWIW if we need more maintainers for the non-Tables plain Statistics ecosystem, I will happily volunteer.

4 Likes

Tables.jl is a trait-based API without parent type. We can’t rely on multiple dispatch to disambiguate with methods designed for arrays. We need a complete redesign with tables in mind.

2 Likes

Ah I see.

That brings up the whole traits/interface discussion, which we probably do eventually want language-level support for? In this particular case, it does make me wonder if it mightn’t have been simpler to have a package which just defines abstract type Table end and have everyone who wants to buy-in extend and subtype that? But that’s major digression.

1 Like

I don’t fully understand all the implications of this plan. Does it mean that one wouldn’t be able to do basic statistics (mean, variance, histograms, etc) on arrays? I hope this is not the case.

19 Likes

Pretty much the only functions from Statistics that I use are mean and std, and I never call them with tables, so I would hope that at least those functions would maintain methods for arrays.

I think this would be a bit off-putting for people (like me) who just want to compute the mean of a collection of values (i.e., an array).

30 Likes

That doesn’t work for tables from Base, e.g. a vector of named tuples or a named tuple of vectors.

It means that anyone doing serious stats that is much more than just computing mean std and histograms could leverage a high-level and clean api while a basic API for arrays could be easily crafted in parallel or a simple wrapper type like MLJ.table could be used to convert an array to have dummy names without performance issues.