Julia stats, data, ML: expanding usability

I did not talk about channels. Just about the problem of writing any devise that creates a sequence of vectors by what ever protocol (iterator over vectors, writing into a matrix, return a matrix) so that they can be used where they are needed efficiently and comfortably. There is not even an agreed solution to that.

Not to talk about records which aren’t vectors…

I take your point. Perhaps I am commenting in the wrong thread here as I’m not a data scientist. As an ecologist R and Julia are opposite ends of the spectrum for me. R provides all standard ecological analyses and then some, in well tested/trusted packages or base. But I avoid using it where I can. Whereas in Julia I actually enjoy building my own solutions.

Anyway, my comments were in reference to GLM.jl not doing anything new and exciting. I guess my perspective is limited in this regard as i’m not a data scientist. For me getting a linear model by typing lm(y ~ x + z) is fine. I’m not sure what extra excitement should be provided.

In any case, I think its a good point that we shouldn’t just focus on “what’s missing here, that other languages have” but also “what’s missing elsewhere that can set Julia apart”.

6 Likes

I personally think that “standard analyses” are usually misguided in science, but can be useful in certain environments, like regulatory compliance or process monitoring. If you are measuring pollutants in water or air, daily or hourly or something, and you want to show that you are meeting some requirements then obviously there is some standard thing you need to repeatedly run. Monitoring in general has that flavor. It’s similar with say sampling parts on a production line or detecting illegal content on a public website, or spam in your email.

But when it comes to science, where you are trying to understand a process not under your control, you have to build models of the process, usually dynamic models, such as ODE, or agent based, or discrete time or spatial point processes or whatnot. Then do inference for that model. This point of view argues for the toolbox with high quality tools and inherent speed. This is where Julia excels.

This is a good point but I think requires a bit of nuance.

Since in Julia standardization comes from method overloading (which is true in R to some extent, of course), broom is a solution to a problem that Julia theoretically shouldn’t have. In theory all OLS-related modeling packages should conform to the StatsModels.jl API. GLM.jl, FixedEffectModels.jl, and Econometrics.jl all do that which is good. coef, stderr, etc. all work the same across packages. We are missing actually putting results in a table of some kind though, though. So you have to learn a bunch of methods instead of just querying a table you already know how to work with.

In practice there may be gaps. Perhaps some packages are not fully compatible with the StatsModels.jl API. If that is the case, they should be fixed. I don’t do enough regressions in Julia at the moment to have a good knowledge of these gaps, though.

But incorporation with CovarianceMatrices.jl is incomplete. You can’t modify a model after-the-fact to give it the standard errors you want, while still preserving compatability with the full StatsModels.jl API. There is discussion on this here, which has been stalled.

This missing link is pretty important. It means that you can’t print regression tables with RegressionTables.jl with custom standard errors.

If someone has a motivated RA and knowledge to oversee this stuff, working through this integration would be really beneficial.

I couldn’t agree more! However, people have to keep reviewers and co-authors happy. Also, Julia should be accessible to those just learning stats for the first time. And there are many other reasons why someone may want to have access to standard analyses for their field.

2 Likes

This is the crux of the issue, is it not? There have been some great posts from a variety of perspectives on how to move the ecosystem forward, but only limited maintainer resources to implement them. Trying to do everything is not feasible and more likely to result in disappointment across the board, so some kind of prioritization is required.

Lest anyone think the struggle is limited to stats/data science, let me say we have similar troubles with conflicting priorities on the deep learning/diffprog side of things. For example, do we:

  1. Try to support more flexible AD to support a wider range of workflows (some of which are novel and not well supported in other languages)?
  2. Try to improve the performance of existing libraries to attract more folks from “mainstream” ML/DL? Even this can be further subdivided into horizontal vs vertical scaling and latency (e.g. time to first gradient) vs throughput (e.g. GPU kernel perf).

The big DL frameworks have an easier time of this because their goals are clear: whatever the big corporate users want is probably going into the framework. This works out well because said users are willing to finance development work (sometimes to the tune of millions) to achieve their ends. Conversely, this is also why frameworks from most organizations share almost no common functionality and don’t interoperate with each other! Replicating the good parts of this model in Julia land because of the chicken-and-egg phenomenon others have discussed.

All that said, I think the explosive growth of SciML has shown that it is possible to pull off “we want X and we will give you the people/money you need to do it” without creating your own island. I’m not sure whether it’s realistic to expect every domain ecosystem to follow the same path, but I feel this is a tangible success story to draw on given many of the proposals thus far have been (necessarily) abstract.

4 Likes

I think I never fully understood this problem. Why can’t one create a new, say, regressionmodel, that contains the adjusted estimates?

Maybe the issue is that the current abstraction for regression models does not separate the model from the estimator, and the model from the estimates, and the example with CovarianceMatrices is an instance of that. Designing an abstraction that deals with all of this isn’t easy, however.

1 Like

I don’t think it’s a problem. Making a new regression model with the adjusted estimates is the correct path forward, I think. It just hasn’t been done, but doing so would increase inter-interoperability a lot.

1 Like

Reading through the slide deck and the discussion here, I was struck by how many of these issues I’d run into myself. In my case, I’ve been at least tinkering with Julia for the better part of a decade, since the very early days, and so puzzling out these kinds of interface issues didn’t seem like such a big deal. But reflecting on it, I can see how these “minor” difficulties could actually be a huge block for people, especially inexperienced users, trying to pick up Julia or make the switch from another language.

Thinking through this, I sketched out this diagram, arranging people on two axes. One is how well they can accomplish their analysis goals in R, Python, or some other tool. The other is how willing they are to be an early adopter, and to work through the bugs, inconsistent interfaces, and poor documentation that may entail. The size of each circle in this diagram indicates how many people fall in that category:

  • People in group A are spending their time developing and improving R/Python. They may try out Julia out of curiosity, but most won’t be that motivated to contribute.
  • People in group B are the most likely to pick up, and contribute to, Julia. Most of the people in this forum fall in group B.
  • People in group C have no strong reason to pick up Julia. (If they are students, say, and their professor teaches a stats course using Julia, they may continue to use it…although they could also be discouraged by the confusing interfaces, poor documentation, etc. mentioned above, and switch to more popular tools like R.)
  • People in group D should be using Julia. However, they are also the most likely to be discouraged by interface and documentation issues. (I also suspect a significant number of people in group D think they’re actually in group C–these folks were the target audience for this talk I gave a few months ago.)

I don’t know the true numbers in each group, but my intuition says that Group D is where we’ll recruit new developers from, as they learn Julia, gain proficiency, and move from D to B. If this is the case, fixing some of these “minor” interface issues could have an outsize payoff for the Julia stats community and ecosystem down the line. In discussing these questions–e.g., whether do a quick fix to make all stats functions accept Tables, or to wait for the Next Great Interface For Statistics to emerge that takes full advantage of Julia’s capabilities–there’s not actually a conflict, and doing the former may actually help get us to the latter faster.

7 Likes

From my perspective (economist typically working with structural econometric models) the point where Julia can shine (compared to R, Python, Stata, Matlab, Fortran i.e. the tools that economists tend to use) is to bring data cleaning and descriptive analysis, estimation of linear models, estimation of “structural” models, and simulation of these models all into one environment. The other languages tend to do well on one of these tasks, but none is good at all of them. In my view, Julia isn’t great on the data cleaning side yet, but there are smart people working on it, so we’re going to get there.

Since a part of this thread is about the “what can Julia add”, some thoughts: I find myself using tabular datatypes less and less with Julia. When we run regressions, the data is structured into observations, not rows or columns; these observations are drawn from a population about which we want to learn. Knowing the population characteristics, I would be able to sample from it, or simulate outcomes that depend on them. A more complete abstraction would bring all these elements together, and thereby avoid useless code such as “fill estimates into objects used for simulation” and “construct tabular data containing all observations from here and there” etc. But again, designing such an abstraction isn’t easy, and even if it were to exist, one may not want to force it upon a new potential user that just wants to run a linear regression.

4 Likes

That’s along the lines of Julia stats, data, ML: expanding usability - #38 by dlakelan

Plugging it again, but Something like this? Data Access Pattern — MLDataUtils.jl v0.1 documentation

1 Like

I’ve been lightly following this thread and followed the link to MLDataUtils.jl. I thought: “hey that’s a pretty good interface design! Why didn’t we have that a few years ago?!”
Fast-forward to a few minutes later when I get to the bottom and see my name :sweat_smile: lol

5 Likes

I didn’t have time to read all comments in detail, but I am jumping in to add that we could certainly exploit ScientificTypes.jl more in the DS ecosystem. This would enable better defaults everywhere, e.g. colobars for categorical variables in plotting packages, better treatment of “fancy” columns in a dataset that are not necessarily made of <:Number entries.