Request for un'stdlibfication of Statistics

It seems fine to me for StatsBase to just pick one and use it. Suppose NamedArrays is chosen. If the StatsBase developers later decide that they want to switch to AxisArrays, no big deal. Just increment the major version. Julia has a good package manager, so there’s no reason to be afraid of incrementing the major version number. :slight_smile:

2 Likes

I don’t know if I’m comfortable with the apparent implication here that anyone who wants to do “serious” stats has to do so via the Tables and/or DataFrames ecosystem?

19 Likes

It should be feasible for most statistical methods to support both tables and vectors/matrices. For example, we can have the following set of methods for cor:

cor(x::AbstractVector, y::AbstractVector)
cor(X::AbstractMatrix; dims=1)
cor(tbl::Any; x, y)  # Where `x` and `y` each accept a column name

Currently we don’t have the last of those methods because the Tables.jl interface is not in Base. I don’t think anyone is proposing that the statistics ecosystem require the usage of DataFrames.jl, just the lightweight Tables.jl interface package. So, in other words, you can choose to use any table that meets the Tables.jl interface, including TypedTables.jl or vectors of named tuples.

3 Likes

Maybe a good first step is to write function wrappers table(cor)(x) that only uses the Tables.jl API and see how far that gets?

Take a look at GLM.jl and similar packages. You express a statistical problem in a context with variable names which mean something. You do analysis of these variables and make plots with proper labels. You explain your effects you test hypothesis, etc. I meant that if you try to do anything that is more than a couple function calls on arrays without variable names, things will get messy with the current ecosystem we provide. R has nailed it with tables and scientific types for a while and I’m advocating for a similar experience in Julia. If you want to stay low level with arrays, that is fine, but that is not the most productive path for statistical work.

2 Likes

Honest question, though slightly off-topic: How has R nailed “scientific types”?

My take on this is that some basic statistics (mean, std, cor,…) should probably be shipped with the Julia distribution (that is, be a stdlib). The reason is that doing std(x) probably is almost as common as doing cos(x). Providing this in a stdlib lowers the threshold. The rest can be split off.

…and yes, please don’t hook all stats functions to DataFrames or the like. It should all work with a traditional Julia vector/array.

31 Likes

I think that providing statistics API to work with tables is worthwhile, especially for doing interactive analyses, but I would prefer not to have any existing arrays functionality break for more low-level applications. For example, it seems that while arrays are more low-level and don’t carry metadata, they can be more versatile than tables, e.g. when doing statistics along multiple dimensions. I think that would be less intuitive to do in a table.

12 Likes

For example, I am teaching a course where students have to learn about types too early to convert Matrix(table) and DataFrame(array) all over the place.

I also teach finance/econometrics courses on the MSc/PhD level, and what you mention is one of the reasons why my lectures stick to vectors/arrays. It also helps to connect to traditional theory which is often expressed in those terms.

9 Likes

It looks like Statistics.jl was already moved out of the Julia repo and into JuliaStats (on the main branch) [no later than] two days ago. Or am I confused?

https://github.com/JuliaLang/julia/pull/45597

1 Like

Looks like it! But AFAIU it hasn’t been de-stdlibified yet (stdlibs don’t have to live in the Julialang repo IIUC)

Yes, Statistics has been in a separate repo for a long time. It’s just been moved from JuliaLang to JuliaStats, and removed from the sysimage on Julia master. This doesn’t have much user-visible effects, the significant step will be removing it from the stdlib and registering it as a normal package.

1 Like

Ah. that explains it. That must be exactly what the PR about Statistics.version is for.

Anyway, one big advantage: I just opened a bug issue for Statistics. I don’t submit PRs often enough to Julia to go through the trouble. I just make a report. But now that Statistics is at least in it’s own repo, i’m much more inclined to also do the fix.

`Statistics.middle(Base.Slice(4:6))` fails · Issue #113 · JuliaStats/Statistics.jl · GitHub

3 Likes

What Statistics.jl functions should have a tabular interface? Those listed at Statistics · The Julia Language seem fundamentally vector-based. Take the simplest one, mean, for example: how does it make sense to pass a table to mean?

The dims argument would likely become obsolete after the new Slices type usage spreads. And it’s not really related to tables anyway.

1 Like

The main thing I’ve seen come up repeatedly is cor(df), for which there are ways of doing it of course but none as convenient as df.corr() in pandas.

In my mind, tables as data structures have two nice properties for statistical work. You have variable names that you can use in formulas and to print more readable output. And you know that all your columns / rows are of the same length. So you save a bit of boilerplate to ensure that in every function.

However, as noted above, a lot of statistical functions don’t necessarily need same-length input. For example, I can compute a t-test between differently sized groups. So I don’t see what the table requirement gains here.

Another thing to note is that R can play a little “trick” that we can’t, which is, it can use the variable names of run-of-the-mill vectors to print more readable output, using non-standard evaluation like this:

heights = c(1, 2, 3, 4, 5)
some_statistical_function(heights)
# could print "The output of some_statistical_function for heights is XXX"

We can only do that with macros, however we certainly don’t want to macro-ify every statistical function out there. So I guess we just have to live with that.

3 Likes

Speaking of Statistics, StatsBase.jl, … and all packages living in JuliaStats. Could someone in the organization take the lead and archive old packages that are not touched in years, and for which there are no plans of revival?

I’ve done that recently in JuliaML and it was quite helpful to set a direction moving forward. There were tons of packages polluting the organization and making it difficult for beginners to navigate the current state of the ecosystem.

After archival, it would be great if we could take a joint look at JuliaStats, JuliaML, JuliaAI and all related orgs to find intersections and directions moving forward.

5 Likes

I would like to suggest we do not un’stdlibify Statistics until Julia v1.10.
Because Julia v1.9 will be our first time un’stdlibifying anything.
And it is good that it is being done to a standard library that very few people use (DelimitedFiles.jl), and even fewer people should use (CSV.jl is better for most purposes).

But Statistics.jl is used by almost everyone.
So if it turns out the way un’stdfication works in some way goes wrong.
idk how, but unknown unknowns, I would rather not have that happen to Statistics.jl.
Let’s let it get a proper working through during 1.9 then we can revist this question?

25 Likes

I personally feel that delaying what has been delayed for years already is not a good path. This situation with Statistics and StatsBase.jl is there exactly because of this eternal delay.

2 Likes

I agree with what you say about Statistics.

However, I am fond of DelimitedFiles and wish it to stay afloat. It does a good job in many cases and is light weight (in stark contrast to CSV.jl).

17 Likes