Request for un'stdlibfication of Statistics

juliohm · June 8, 2022, 3:35pm

I feel that many of us doing statistics and data science in Julia agree that the statistics ecosystem could be improved by a lot if we had more freedom to evolve Statistics, StatsBase.jl … to support Tables.jl tables instead of bare arrays, and ScientificTypes.jl instead of machine types. I personally never saw Statistics as a general purpose stdlib.

Any chance that we could move the Statistics stdlib out of Julia v2.x to let the people in the field manage the ecosystem moving forward?

The current situation with packages that only support arrays with samples as columns vs. rows, packages that only support dataframes, plotting libraries that only work with arrays… is really cumbersome for end-users with a statistics background. For example, I am teaching a course where students have to learn about types too early to convert Matrix(table) and DataFrame(array) all over the place.

johnmyleswhite · June 8, 2022, 3:41pm

Is there an owner for these things being moved out of the stdlib?

carstenbauer · June 8, 2022, 3:41pm

Fwiw, see https://github.com/JuliaLang/julia/pull/45540#issuecomment-1147918438. The plan seems to be to do this even pre 2.0 in a non-breaking way.

juliohm · June 8, 2022, 3:43pm

Good question @johnmyleswhite , I don’t know.

juliohm · June 8, 2022, 3:48pm

I love to see this un’stdlibfication of specialized packages. These will be developed much more quickly by the community who is working with the functionality every day.

brenhinkeller · June 8, 2022, 3:52pm

This is being worked on already I do believe!
https://github.com/JuliaLang/julia/pull/45594
https://github.com/JuliaLang/julia/pull/44247

I am actually totally on board with this for Statistics, despite some baseline level of apprehension with this sort of thing, because reuniting StatsBase with Statistics is just worth it.

JeffreySarnoff · June 8, 2022, 3:52pm

The intent comports with the overall desire to slim stdlib lib count.

johnmyleswhite · June 8, 2022, 3:57pm

I am getting pretty nervous about the direction I see people pushing here, even though I would prefer to see Statistics moved out of Base some day.

There’s a contradiction between three things I’ve heard in the past:

There is no clear owner for things moving out of Base.
JuliaStats has too few developers.
The community will move faster once these things come out of Base.

Why is (3) true if (1) and (2) are true? They can’t all be right.

juliohm · June 8, 2022, 3:58pm

It is true because there exists people outside of Base and JuliaStats doing statistics in Julia.

This issue that JuliaStats has few people is a separate issue. I believe that many of us could be there but are not I am in JuliaML for example, and that would relate to the efforts we are discussing here. There I am working on TableTransforms.jl, TableDistances.jl and other efforts that provide a consistent API for Tables.jl tables and ScientificTypes.jl.

brenhinkeller · June 8, 2022, 4:05pm

That is a good point

juliohm · June 8, 2022, 4:07pm

I don’t know if it is. It ignores the existence of people outside of JuliaStats. The stats community in Julia is much larger than that.

brenhinkeller · June 8, 2022, 4:09pm

I’d also very wary of tying what used to be a stdlib to the whole DataFrames.jl and Tables.jl ecosystem. It’s a great ecosystem, don’t get me wrong, but it’s an opinionated way of doing data science and can be a heavy dependency.

Also perhaps more to the point, if that’s the development goal, there’s actually no need to move anything out of base/stdlibs, because you can just import and extend the methods from Statistics, no need to worry about versioning, and it’s not type piracy because the DF/Tables ecosystem owns the DF/Tables types.

juliohm · June 8, 2022, 4:25pm

And that is fine. I think that communities are built around opinion and experience doing things. DataFrames.jl and Tables.jl are lovely specially you are coming from a statistics background, from R, etc.

Disagree. The need is real and we are stuck without any advances because we can’t radically change Statistics at this point, merge it with StatsBase.jl etc.

nalimilan · June 8, 2022, 4:28pm

To me the main goal of moving Statistics out of the stdlib is to fix this absurd divide between Statistics and StatsBase. It will also allow using new features and fixes in older Julia versions, like any other package. But I don’t expect it to make development magically faster than the current pace in StatsBase.

Anyway I’m not sure how this move impacts the ability to support Tables.jl and ScientificTypes.jl in Statistics. We could have done so in StatsBase if we wanted. Could you be more specific @juliohm?

brenhinkeller · June 8, 2022, 4:28pm

I totally agree it should be re-merged with StatsBase, though that’s more of an aesthetic opinion than anything else. I’d be wary about other “radical” changes.

juliohm · June 8, 2022, 4:31pm

Statistics cannot depend on 3rd-party packages such as Tables.jl and ScientificTypes.jl, I don’t know if I got the question right.

Regarding StatsBase.jl StatsAPI.jl, etc… I think these could all be restructured with lots of breaking changes to support tables. These breaking changes will lead to a consistent API that users need to adhere to. Right now no one is willing to adhere to a new API when there exist a stdlib that is more “official”. We really need to radically clean up all these efforts without concerns of backward compatibility, support for super old Julia versions, etc.

nalimilan · June 8, 2022, 4:33pm

Right, but StatsBase can, and we haven’t done it (I’m not aware of request to do so).

Can you give some examples?

brenhinkeller · June 8, 2022, 4:33pm

Why would it need to? Why can’t you, as discussed above, import Statistics.jl functions in an external package that’s managed by the DF/Tables ecosystem and extend those functions for Tables.jl and ScientificTypes.jl types there

juliohm · June 8, 2022, 4:41pm

Many functions in Statistics and StatsBase.jl assume arrays and provide keyword arguments for sample dimension like dim=1, many plotting libraries assume arrays as inputs and provide keyword arguments such as labels=... and many other examples of keywords who would be rendered useless with tables. The point I am trying to make is that the experience could be much nicer if the whole ecosystem for statistics was designed from scratch for tabular data.

Yes, we could maintain methods for arrays, but who is going to maintain those? Do we really need them when in practice most statistical tasks deal with tables (arrays + variable names)? Arrays are just not the right representation for statistical work, they are too low-level. That is opinion, of course, but we need opinion sometimes to converge and provide a consistent experience for end-users. Otherwise we will always provide packages that feel disconnected. Again, I am mainly concerned with end-users, beginners, first-time users of Julia for statistics, etc. Let’s not forget that we can cope with this API variability, but beginners will struggle.

juliohm · June 8, 2022, 4:45pm

I can but that is too noisy, confusing, just in order to get a name like Statistics.mean into scope for extension. I’d rather work in a cohesive ecosystem developed by statisticians than try to amend the current situation with the stdlib. Statistics is not the job of Julia core devs, they have other priorities. This stdlib just increases sysimages without major benefits.

Topic		Replies	Views
Julia stats, data, ML: expanding usability Statistics statistics	84	5079	October 14, 2021
Wrong links to the Statistics standard library Statistics package	4	294	August 26, 2023
Fork and build Julia for a PR to stdlib? New to Julia	3	355	March 13, 2020
Function to calculate the mode in Julia? General Usage statistics	4	4922	November 15, 2019
Julia as a universal platform for statistical software development Community announcement	14	2210	April 19, 2024

Request for un'stdlibfication of Statistics

Related topics