Request for un'stdlibfication of Statistics

Here we used some of the statistical moments without a tables interface and would like to be able to continue to do that whatever happens with the the location of the package.

1 Like

I agree that Statistics+StatsBase have a weird separation and that we should probably work towards a joint solution.

I also agree with those that feel that perhaps some functions are so elementary that they should live in some form in Base, like mean. Why do I even need to load a package to use that? We have a LinearAlgebra library, but I don’t need to load it to use * to multiply matrices or to take the transpose of things.

I’d also like to hopefully clarify some of the discussion in this thread for those that might not be familiar yet. Taking Statistics.jl out of the Standard Library can be done without any changes to the package or user experience. Most of the other discussion is regarding proposed changes to the package after it leaves the Standard Library.

On that discussion, I will say that I agree that table-based statistics has a lot of room for improvement, but it is worth stating what I hope is obvious: mean(x::AbstractVector) must not be infringed! That is, I would hope that we do not ruin the user experience for people doing statistics on non-tables.

31 Likes

I’m chiming in as a casual Julia user, who only found this discussion because of an email summary of recent activity.

I’m a physicist who uses Julia regularly for simulation and data analysis. I use various stats functions frequently, but I have never used a table or any other fancy statistical types. In fact, I’m not sure of any situation where I would want to use tabular representation of data. My data typically has a geometric interpretation, like the distribution of some quantity over 1, 2, or 3 dimensional space, or a signal sampled in time. I almost always use arrays, and if I need a different interface (which happens rarely) then I add my own types to accomplish it. I would find it very irritating if I had to use DataFrames or something similar. I have looked into this representation several times, but I have always found that it added nothing of value for me, that it is an unnatural representation for my work, that it clutters my code, and that it makes it difficult for me to explain my code to other physicists (who by and large don’t know what a table is, in the statisticians’ sense).

So the impression I would like to make is this: Please keep in mind that it’s not just statisticians who use statistics functions. Many of us just want a simple interface to basic stats functions for arrays. I don’t care how this is accomplished, as it concerns splitting packages and such. But I would like to be able to type “Julia standard deviation” into google or into the Julia documentation and find (in under 30 seconds) a function that accepts an array and returns the standard deviation.

46 Likes

I am a polymer physicist. Me too, I don’t like the interface of DataFrames.jl. I feel it is unnatural and hard to communicate with my colleagues. I’d rather do simple statistics with AbstractArray. mean, std, cor, etc. are not only for statisticians and data scientists. Please keep it in the Base.

16 Likes

Keeping it in base and adding Dataframes support are two different issues. The only difference separating out Statistics would make for users is they’d have to type out add Statistics before calculating std or cor.

4 Likes

There are no plans to make it impossible to use mean with an array. The only question is whether we’d be willing to also add support for using it with DataFrames and Tables.

1 Like

How should mean and other simple statistical functions specialize on tables, and should they really?

It’s hard to guess what mean(tbl) should do: mean of all values in the entire table? In each row? In each column?
For some table types, mean(tbl) already work coincidentally and returns the per-row means, for example:

julia> tbl = (a=[1,2,3], b=[4,5,6])
julia> mean(tbl)
3-element Vector{Float64}:
 2.5
 3.5
 4.5

A more intuitive approach, that works right now and doesn’t require adding anything, is map(mean, ...):

julia> map(mean, Tables.columns(tbl))
(a = 2.0, b = 5.0)

julia> map(mean, Tables.rows(tbl))
3-element Vector{Float64}:
 2.5
 3.5
 4.5

Here, it’s clear whether we compute over columns or rows.

What’s missing in these examples? Why special treatment of tables is needed in basic stats at all?

6 Likes

I agree; I’m not saying it should be added as a dependency, just that people seem to be misunderstanding what the proposal is.

I don’t think we would want to add a mean(tbl) method. (In fact, we can’t, because there’s already a generic mean(itr::Any) method.) One of the reasons it would be nice to merge Statistics.jl with StatsBase.jl is because currently we have the following method in StatsBase.jl for calculating a weighted mean:

mean(A::AbstractArray, w::AbstractWeights)

Note that in order to pass weights, they have to be a subtype of AbstractWeights, so you have to call this function like this:

w = rand(n)
mean(x, weights(w))

It would be preferable to have a generic method like mean(x, w), or to add a keyword argument, like mean(itr; w=nothing). But those methods can’t be added to StatsBase.jl, because the first one would be type piracy and the second one would actually overwrite the mean(itr) method in Base Julia.

3 Likes

None of it is/was in Base. You can keep doing all of this, with the proposal (except cor) with Base:

I’m just clarifying what’s being done, adding (only) these three to Base (and well mean!, but not cor, you have to draw the line somewhere), and even adding these were controversial.

Nothing is being taken away, all the same functionality and syntax from Statistics is there:

Well, for cor (no longer std). And note, if you’re making a package using Statistics, it needs to be added to your Project.toml.

1 Like

I don’t understand how this is not a break in backward compatibility. If it gets out of stdlib then it must be installed as package. (and I’m of those who think that it is … to add a dependency of how many lines? just to do a mean, or std).

Yes, I read the argument that with improvements in Pkg that requires to just hit the y key. But what about existing programs that run automatically?

2 Likes

You still have to list stdlibs as dependencies in package code.

1 Like

That’s a completely different story. Yes one must do that. But there is no package that needs to be installed. Whilst if it flies away all packages that depends on Statistics and don’t have it installed will hang until some mechanical force pushes the y key.

If you’re running code without human intervention, then it really should be associated with a local Project.toml and not the global environment. In which case, you should always make sure that the project is instantiated before running it, which will handle all the dependencies, even the ones in the stdlib.

If a user is running non interactive code without a local environment … well movement of Statistics is the least of their problems

3 Likes

All existing code will use say Julia 1.8.0 or older and just work. When this actually lands in Julia 1.9.0 presumably you can add using Statistics in your startup.jl. It should work for all scripts that do not disable the startup file (using it is the default even for scripts, something I disagree with, but that’s another story, I would want it off by default for them and benchmarking, to not forget to do that, and one for interactive).

I believe this should also work for packages, not sure, haven’t tested. Another option is a non-default sysimage that add Statistics back, then you’re basically back to square one. It’s an option, and getting very easy to do.

Either you misread or didn’t read my comment. You actually don’t need [to install] Statistics for that (only for cor and more advanced), to mitigate some of the change. E.g. mean works, and Base.mean (no need or much use for), but yes, no longer Statistics.mean I think, if you qualify but I guess that’s rare since not needed to do.

It can justifiable be viewed as breaking, but there is actually no promise of NOT breaking in this way for 1.x.y to 1.x+1.y. The promise is about syntax, and exported (from Base) API (not marked experimental). It’s a very fine distinction.

If this is a worry, or using 1.8.0 by the time no longer supported, then LTS is an option… until the package ecosystem catches up (I think that will happen quickly; before 1.9.0).

You’re right, I didn’t (:wink:), but had read the discussion issue and the decision is not yet taken, although the clear tendency is to remove even mean from base … unless users resist strongly.

I support this message. I was so perplexed when I noticed that mean is not present by default.

3 Likes

I might add to this discussion the following: tables are inherently 2-dimensional, hence, ill-suited to multi-dimensional data from both a conceptual and efficiency standpoint. Much of StatsBase.jl would likely benefit from a multi-dimensional generalization which supports dims::NTuple{N,<:Integer} where {N} kwargs. I’ll pencil in some time for contributing multidimensional methods… I use them all the time in my work. With @brenhinkeller 's approval I can cook up equivalent versions for VectorizedStatistics

I don’t think that’s correct. Tables are more like sets, they contain many n-tuples. They either are not dimensional at all or are N dimensional where N is the number of columns.

Still I understand that they are not efficient at storing regular grids of 2 or 3 dimensional field data such as pressure or temperature etc