Julia stats, data, ML: expanding usability

mschauer · September 13, 2021, 8:18am

I did not talk about channels. Just about the problem of writing any devise that creates a sequence of vectors by what ever protocol (iterator over vectors, writing into a matrix, return a matrix) so that they can be used where they are needed efficiently and comfortably. There is not even an agreed solution to that.

Not to talk about records which aren’t vectors…

EvoArt · September 13, 2021, 8:43am

I take your point. Perhaps I am commenting in the wrong thread here as I’m not a data scientist. As an ecologist R and Julia are opposite ends of the spectrum for me. R provides all standard ecological analyses and then some, in well tested/trusted packages or base. But I avoid using it where I can. Whereas in Julia I actually enjoy building my own solutions.

Anyway, my comments were in reference to GLM.jl not doing anything new and exciting. I guess my perspective is limited in this regard as i’m not a data scientist. For me getting a linear model by typing lm(y ~ x + z) is fine. I’m not sure what extra excitement should be provided.

In any case, I think its a good point that we shouldn’t just focus on “what’s missing here, that other languages have” but also “what’s missing elsewhere that can set Julia apart”.

dlakelan · September 13, 2021, 1:07pm

I personally think that “standard analyses” are usually misguided in science, but can be useful in certain environments, like regulatory compliance or process monitoring. If you are measuring pollutants in water or air, daily or hourly or something, and you want to show that you are meeting some requirements then obviously there is some standard thing you need to repeatedly run. Monitoring in general has that flavor. It’s similar with say sampling parts on a production line or detecting illegal content on a public website, or spam in your email.

But when it comes to science, where you are trying to understand a process not under your control, you have to build models of the process, usually dynamic models, such as ODE, or agent based, or discrete time or spatial point processes or whatnot. Then do inference for that model. This point of view argues for the toolbox with high quality tools and inherent speed. This is where Julia excels.

pdeffebach · September 13, 2021, 1:40pm

This is a good point but I think requires a bit of nuance.

Since in Julia standardization comes from method overloading (which is true in R to some extent, of course), broom is a solution to a problem that Julia theoretically shouldn’t have. In theory all OLS-related modeling packages should conform to the StatsModels.jl API. GLM.jl, FixedEffectModels.jl, and Econometrics.jl all do that which is good. coef, stderr, etc. all work the same across packages. We are missing actually putting results in a table of some kind though, though. So you have to learn a bunch of methods instead of just querying a table you already know how to work with.

In practice there may be gaps. Perhaps some packages are not fully compatible with the StatsModels.jl API. If that is the case, they should be fixed. I don’t do enough regressions in Julia at the moment to have a good knowledge of these gaps, though.

But incorporation with CovarianceMatrices.jl is incomplete. You can’t modify a model after-the-fact to give it the standard errors you want, while still preserving compatability with the full StatsModels.jl API. There is discussion on this here, which has been stalled.

This missing link is pretty important. It means that you can’t print regression tables with RegressionTables.jl with custom standard errors.

If someone has a motivated RA and knowledge to oversee this stuff, working through this integration would be really beneficial.

EvoArt · September 13, 2021, 2:29pm

I couldn’t agree more! However, people have to keep reviewers and co-authors happy. Also, Julia should be accessible to those just learning stats for the first time. And there are many other reasons why someone may want to have access to standard analyses for their field.

ToucheSir · September 13, 2021, 6:00pm

This is the crux of the issue, is it not? There have been some great posts from a variety of perspectives on how to move the ecosystem forward, but only limited maintainer resources to implement them. Trying to do everything is not feasible and more likely to result in disappointment across the board, so some kind of prioritization is required.

Lest anyone think the struggle is limited to stats/data science, let me say we have similar troubles with conflicting priorities on the deep learning/diffprog side of things. For example, do we:

Try to support more flexible AD to support a wider range of workflows (some of which are novel and not well supported in other languages)?
Try to improve the performance of existing libraries to attract more folks from “mainstream” ML/DL? Even this can be further subdivided into horizontal vs vertical scaling and latency (e.g. time to first gradient) vs throughput (e.g. GPU kernel perf).

The big DL frameworks have an easier time of this because their goals are clear: whatever the big corporate users want is probably going into the framework. This works out well because said users are willing to finance development work (sometimes to the tune of millions) to achieve their ends. Conversely, this is also why frameworks from most organizations share almost no common functionality and don’t interoperate with each other! Replicating the good parts of this model in Julia land because of the chicken-and-egg phenomenon others have discussed.

All that said, I think the explosive growth of SciML has shown that it is possible to pull off “we want X and we will give you the people/money you need to do it” without creating your own island. I’m not sure whether it’s realistic to expect every domain ecosystem to follow the same path, but I feel this is a tangible success story to draw on given many of the proposals thus far have been (necessarily) abstract.

jmboehm · September 13, 2021, 8:28pm

I think I never fully understood this problem. Why can’t one create a new, say, regressionmodel, that contains the adjusted estimates?

Maybe the issue is that the current abstraction for regression models does not separate the model from the estimator, and the model from the estimates, and the example with CovarianceMatrices is an instance of that. Designing an abstraction that deals with all of this isn’t easy, however.

pdeffebach · September 13, 2021, 9:23pm

I don’t think it’s a problem. Making a new regression model with the adjusted estimates is the correct path forward, I think. It just hasn’t been done, but doing so would increase inter-interoperability a lot.

ElOceanografo · September 13, 2021, 9:27pm

Reading through the slide deck and the discussion here, I was struck by how many of these issues I’d run into myself. In my case, I’ve been at least tinkering with Julia for the better part of a decade, since the very early days, and so puzzling out these kinds of interface issues didn’t seem like such a big deal. But reflecting on it, I can see how these “minor” difficulties could actually be a huge block for people, especially inexperienced users, trying to pick up Julia or make the switch from another language.

Thinking through this, I sketched out this diagram, arranging people on two axes. One is how well they can accomplish their analysis goals in R, Python, or some other tool. The other is how willing they are to be an early adopter, and to work through the bugs, inconsistent interfaces, and poor documentation that may entail. The size of each circle in this diagram indicates how many people fall in that category:

People in group A are spending their time developing and improving R/Python. They may try out Julia out of curiosity, but most won’t be that motivated to contribute.
People in group B are the most likely to pick up, and contribute to, Julia. Most of the people in this forum fall in group B.
People in group C have no strong reason to pick up Julia. (If they are students, say, and their professor teaches a stats course using Julia, they may continue to use it…although they could also be discouraged by the confusing interfaces, poor documentation, etc. mentioned above, and switch to more popular tools like R.)
People in group D should be using Julia. However, they are also the most likely to be discouraged by interface and documentation issues. (I also suspect a significant number of people in group D think they’re actually in group C–these folks were the target audience for this talk I gave a few months ago.)

I don’t know the true numbers in each group, but my intuition says that Group D is where we’ll recruit new developers from, as they learn Julia, gain proficiency, and move from D to B. If this is the case, fixing some of these “minor” interface issues could have an outsize payoff for the Julia stats community and ecosystem down the line. In discussing these questions–e.g., whether do a quick fix to make all stats functions accept Tables, or to wait for the Next Great Interface For Statistics to emerge that takes full advantage of Julia’s capabilities–there’s not actually a conflict, and doing the former may actually help get us to the latter faster.

jmboehm · September 13, 2021, 10:22pm

From my perspective (economist typically working with structural econometric models) the point where Julia can shine (compared to R, Python, Stata, Matlab, Fortran i.e. the tools that economists tend to use) is to bring data cleaning and descriptive analysis, estimation of linear models, estimation of “structural” models, and simulation of these models all into one environment. The other languages tend to do well on one of these tasks, but none is good at all of them. In my view, Julia isn’t great on the data cleaning side yet, but there are smart people working on it, so we’re going to get there.

Since a part of this thread is about the “what can Julia add”, some thoughts: I find myself using tabular datatypes less and less with Julia. When we run regressions, the data is structured into observations, not rows or columns; these observations are drawn from a population about which we want to learn. Knowing the population characteristics, I would be able to sample from it, or simulate outcomes that depend on them. A more complete abstraction would bring all these elements together, and thereby avoid useless code such as “fill estimates into objects used for simulation” and “construct tabular data containing all observations from here and there” etc. But again, designing such an abstraction isn’t easy, and even if it were to exist, one may not want to force it upon a new potential user that just wants to run a linear regression.

Akatz · September 13, 2021, 11:33pm

That’s along the lines of Julia stats, data, ML: expanding usability - #38 by dlakelan

Plugging it again, but Something like this? Data Access Pattern — MLDataUtils.jl v0.1 documentation

tbreloff · September 14, 2021, 9:20pm

I’ve been lightly following this thread and followed the link to MLDataUtils.jl. I thought: “hey that’s a pretty good interface design! Why didn’t we have that a few years ago?!”
Fast-forward to a few minutes later when I get to the bottom and see my name lol

juliohm · September 14, 2021, 11:08pm

I didn’t have time to read all comments in detail, but I am jumping in to add that we could certainly exploit ScientificTypes.jl more in the DS ecosystem. This would enable better defaults everywhere, e.g. colobars for categorical variables in plotting packages, better treatment of “fancy” columns in a dataset that are not necessarily made of <:Number entries.

viralbshah · October 8, 2021, 12:50pm

Would it be a good idea to file these specific issues (quoted below) in the relevant repos so that they can serve as starting points? We can then advertise these on our JSOC page and so on.

At the very least, it feels like we want Tables.jl to be used consistently across the statistics ecosystem.

bkamins:

Some examples - per your request (I do not try to give an exhaustive list):

Distributions.jl can generate:

single observation: a number, a vector of numbers, or a matrix of numbers

a set of observations: a vector of numbers, a matrix of numbers (observations in columns), a vector of matrices

Clustering.jl:

input: a matrix (observations in columns) or a matrix (distances)

output: vectors

GLM.jl:

input: formula + table OR target vector + feature matrix (observations in rows)

output: vectors

Statistics.jl and StatsBase.jl:

input: mixed API, but in general not Tables.jl aware, if matrix is passed you can specify which dimension is treated to hold observations

HypothesisTests.jl:

input: mixed: sometimes observations are assumed to be passed, sometimes contingency tables, not Tables.jl aware; if observations are passed then they are generally assumed to be passed in rows

MultivariateStats.jl:

not Tables.jl aware, different methods assume different input, e.g. PCA assumes observations in columns, while linear regression assumes observations are in rows

bkamins · October 8, 2021, 1:14pm

@nalimilan has started the process of trying to clean up things. We need to start from low level API and then gradually build on it. By low level I mean:

deciding on the design/split of fundamental functionalities between Statistics.jl and StatsBase.jl
deciding on a uniform way to handle weights across the whole ecosystem
deciding on a uniform way to handle missing values across the whole ecosystem

There is a working proposal for 2. and 3. (for sure on Slack, I think not yet written down in any permanent place) for 1. the question is quite hard (as moving things to Statistics.jl makes them much more rigid as they have to follow Julia release process, including not being able to make breaking changes until Julia 2.0 even if they would be needed; OTOH Statistics.jl is present in Julia distribution so we cannot just say that we ignore it and how things are currently working there).

Some relevant discussions are: https://github.com/JuliaLang/Statistics.jl/issues/87, https://github.com/JuliaLang/Statistics.jl/issues/88, https://github.com/JuliaStats/StatsBase.jl/pull/723

The Tables.jl issue must be addressed for sure at some point. However Statistics.jl has currently an array-based design and - as commented earlier - we have to embrace it somehow since till Julia 2.0 it has to stay this way.

juliohm · October 8, 2021, 1:38pm

Fully agree with all these points. May I suggest to try to put these drafts together into a next-generation package Stats.jl that is independent of Julia’s release cycle?

This Stats.jl package could be the defacto standard for doing stats with Tables.jl + TableOperations.jl + missing + weights + DataAPI.jl + … and people could be redirected to this dependency in the future instead of the status quo. Notice that this proposal has a slightly different goal than StatsKit.jl. The proposal here is about a package that could serve as a replacement for the current situation with Statistics + StatsBase.jl + Missings + …

Personally, I feel that Statistics could be removed from the stdlibs in Julia 2.0. It is a whole field that is constantly evolving. Its release process is not compatible with the release process of a programming language.

bkamins · October 8, 2021, 1:58pm

This is exactly the core of the issue why this decision is hard. AFAICT Julia 2.0 is very very far away. Then till that time we have Statistics.jl (even if it was decided it should be removed at some point - which I fear is unlikely as there are probably millions of lines of Julia code relying on the fact that Statistics.jl is shipped). Now it e.g. defines mean, var, std, cor, cov, quantile. And you have a tension:

if you define e.g. cor in a completely new way then there is a confusion between cor from base package and cor from extra package - I think we should not introduce such a discrepancy as it would confuse users;
if you extend cor in this extra package then you have to provide an API consistent with cor in base (and at least I tend to think this is the way we have to do things);

A quick example of a problem with dispatch we hit (of course it can be cleaned up):

julia> using Statistics

julia> using StatsBase

julia> x = rand(10, 2)
10×2 Matrix{Float64}:
 0.688046   0.394595
 0.70207    0.0503035
 0.197407   0.602314
 0.109709   0.780435
 0.589199   0.244327
 0.0305052  0.997567
 0.823048   0.943819
 0.414695   0.377289
 0.244565   0.0081141
 0.783279   0.220797

julia> y = rand(10)
10-element Vector{Float64}:
 0.24271476162279826
 0.43123408031832944
 0.9129314010492031
 0.9791403410757626
 0.23835327154290686
 0.7223690508308906
 0.8964953524119175
 0.806587823922122
 0.12234394459347597
 0.35602206859920527

julia> cor(x, y)
2×1 Matrix{Float64}:
 -0.3631563305717558
  0.7731319341269683

julia> cor(x, Weights(y))
2×2 Matrix{Float64}:
  1.0       -0.326363
 -0.326363   1.0

julia> cor(x[:, 1], y)
-0.36315633057175584

julia> cor(x[:, 1], Weights(y))
-0.36315633057175584

Then even if we decided to dump Statistics.jl we have plain Julia Base, which e.g. defines sum and consistent handling of it would have to be resolved somehow (and I bet no one will remove sum from Julia Base).

juliohm · October 8, 2021, 3:32pm

Wouldn’t it suffice to define new cov etc. and not export them? Expecting users of this new package to always call Stats.cov? This would be a completely independent function, not an extension of Base methods. That way things are completely separate and the development can happen independently of Julia Base. I would be happy to do a s/cov/Stats.cov/g in all my codebases and then adjust accordingly if Stats.jl works nicely with Tables.jl.

CameronBieganek · October 8, 2021, 4:25pm

Yes, it would suffice, but it would be a little disappointing to have both Statistics.cov and Stats.cov. In an ideal world we would have one generic cov function that the whole ecosystem can use. Not to mention Stats.cov is a lot less pretty than cov.

juliohm · October 8, 2021, 4:39pm

Notice that this is not the final solution. It is an intermediate state where things are developed separately, then Statistics is removed in Julia 2.0 (if that happens) and finally Stats.jl exports whatever it needs.

I understand, however that some functions in Statistics are being used in other places in Base, so I would just make these functions internal, and let the statisticians develop the Stats.jl alone.

Again this is all very disruptive, but necessary in my opinion. The fact that Statistics has mean and cov but doesn’t have quantile, median and other statistics is kind of super ad-hoc and weird.

Topic		Replies	Views
Request for un'stdlibfication of Statistics Internals & Design statistics , community	78	6341	September 10, 2022
[ANN] New and Improved JuliaDB Community package , announcement	14	2808	August 7, 2018
Julia Ecosystem (respecting hierarchy and common API) - Statistical Models Internals & Design	4	1381	July 19, 2017
Pushing Julia/statistics development Statistics	14	6115	August 8, 2022
Julia as a universal platform for statistical software development Community announcement	14	2185	April 19, 2024

Julia stats, data, ML: expanding usability

Related topics