Julia stats, data, ML: expanding usability

Yes, and there’s a strong frequentist slant to the OP and this whole thread, but I thought that was intentional.

We can meet or marginally exceed R by standardizing tabular/ matrix type interfaces, or we can rethink the whole thing and do something that unifies interfaces for PPL, complex stats and ML.

I’d prefer the latter, but could be disorienting for conventional R users.

The good thing is that ML/Flux folks (@dhairyagandhi96 @darsnack @ToucheSir Lorenz, @oxinabox et al) have already done a lot of work on flexible and fast abstract data interfaces. So any effort need to just build on that. This includes previous efforts by @tbreloff on the mentioned JuliaML

1 Like

That’s great! Assuming your PR is accepted (there’s a long list sitting there for month to no end) it would have taken about three years and a “tangent” reply to this Topic to get things started.

In the slides shared by @mit.edelman to talk about “expanding usability” and “get a conversation started” we can read:

In order for Julia Statistics to supersede R and Python for data analysis, it also needs to have packages for the most popular tasks.

I hope we all agree that any meaningful superseding is hardly is going to take place if the “most popular tasks” return p-values higher than one for about three years… and maybe more. I don’t think a conversation on core packages being very lightly maintained, if at all, is “tangent” to expanding usability, quite the opposite.

However, though I appreciate you might not want to talk about it in this Topic, perhaps it is best not to discourage others to engage in a conversation -in this Topic- on how to improve the quality of the Julia ecosystem to facilitate adoption.

One possibility is to reach out to academic institutions all over the world for them to adopt and maintain core packages and analytical tools. This would not only help to keep these core packages properly updated, but also would encourage students and faculty staff in those institutions to use the language.

1 Like

I would like to add a few thoughts. My background is that I am a long-time finance/econometrics teacher.

It is indeed important that the different packages of the julia/stats eco system can speak to each other. This should not be too hard to accomplish with some conversion utilities (matrix->table, table → df, df->matrix, etc), which I believe already exist. In this scenario the only standardization needed is to ask all packages to accommodate these utilities.

If a natural process of standardization follows from there, then that would be a bonus.

2 Likes

I think @tkf is working on this problem in his JuliaFolds projects at Julia Lab. It would be interesting to have a JuliaFolds-style API for MCMC samplers since they produce a stream of results incrementally.

Can you clarify what it means, maybe in a short snippet? For example, we can use vanilla Base to do

julia> ch = foldl(push!, [[1, 2, 3], [4, 5, 6]]; init = Channel{Vector{Int}}(Inf))
       close(ch)
       append!(Int[], Iterators.flatten(ch))
6-element Vector{Int64}:
 1
 2
 3
 4
 5
 6

This is also my thinking. In what I personally do I use all kinds of: classical frequentist stats, PPL, ML, and also relational data (graphs). That is why I have originally commented that Tables.jl is not enough.

However, such things get very complex very fast if you try to be general and cover all cases. So my thinking was that we could try to agree on Tables.jl as a minimal abstraction that is good enough for most of simple things. In this way people learning the ecosystem would have a bit easier life, and when they start to do more advanced stuff they would have enough general Julia experience to be able to digest more complex APIs.

In particular the good thing is that in Tables.jl a feature can be anything. This means (and it practice there are already such implementations) that you have some specific AbstractArray type that is efficient computationally (e.g. has a proper memory layout for a given problem) while to the Tables.jl API it is exposed in a standardized way - with rows as observations and columns as features.

1 Like

Of course, I am biased but I think the approach that we use in JuliaGaussianProcesses packages such as KernelFunctions and AbstractGPs for how to deal with observations as rows or columns is very nice to work with, both from a developer and a user perspective: you clearly state the layout of your matrix by wrapping it as a ColVecs or RowVecs, and so there is no confusion about the layout, no conventions are needed (which IMO are always somewhat arbitrary), and one does not have to carry around an obsdim keyword argument to all function calls. The documentation of KernelFunctions explains the motivation and advantages of this approach: Design · KernelFunctions.jl

7 Likes

Yes, this is a good way of sidestepping the fight between observations-as-rows and observations-as-columns. Actually, we recently adopted the same approach in StatsBase and in Distances with pairwise: it takes an iterator of vectors rather than a matrix, so that one writes pairwise(f, eachcol(mat)) or pairwise(f, eachrow(mat)). This is particularly useful given that one most often wants to compute e.g. correlations between variables, but distances between observations: so no default works universally, even if we agreed on a convention on how observations are stored.

For now, for performance reasons, pairwise in Distances also has to accept a matrix with a dims argument indicating how observations as stored. Once eachcol/eachrow returns a special object (PR), the pairwise(f, eachcol(mat)) or pairwise(f, eachrow(mat)) methods will be as efficient as the one taking a matrix, so we will be able to deprecate them if we want. Maybe your RowVecs and ColVecs types could be made equal to EachRow and EachCol?

Of course the drawback of this approach is that it’s relatively verbose, notably for simple operations like cor(mat) == pairwise(cor, eachcol(mat)). So we could still assume that observations are stored as rows by default, but at least if we support an explicit syntax in all packages then people can use it to have consistency in their code.

4 Likes

I’m writing a paper to show that Kaplan Meier estimators are a terrible idea with a certain kind of dataset. To do this I’m writing a KM estimator package. I’m an opinionated Bayesian who doesn’t do survival modeling all the time, so I have no interest in maintaining such a package because I’ll never use it. Nevertheless someone is probably interested in having KM estimators and plots. Handing this thing off to an organization that cares would be good. An example of how this issue comes up.

2 Likes

I didn’t know about the plans for eachcol and eachrow, this sounds great! I assume that we could use EachCol and EachRow instead of ColVecs and RowVecs once it becomes available. Probably the only disadvantage is that development and bugfixes in Base take more time and it’s only available in Julia >= 1.7 (EachCol etc. could be added to Compat I assume but eachcol etc. would still return an iterator in older Julia versions).

I would argue though that it would be better to not adopt any convention since both choices seem arbitrary. I don’t see a clear advantage of one approach over the other (well, maybe columns since Julia uses column-major order), I assume people with a background in R or classical statistics might be more used to rows as observations whereas people with a background in ML might be more used to observations being stacked along the last dimension (eg with Flux and in MATLAB).

I also think that it would be clearer if one had to write corr(eachcol(mat)) and corr(eachrow(mat)) instead of corr(mat) where some convention is assumed. To me the example doesn’t seem to be a strong argument for a default convention :slightly_smiling_face: Also often the same data is used multiple times with the same layout, so eachrow etc. could be called just once initially but not in every function call, which would make it even less verbose (and in general it does not seem more verbose than dims = ...).

2 Likes

I think I have the exact opposite conclusion! Everyone should use rows as observations, whether it’s consistent with ML idioms in other languages or not. We don’t have to worry about performance, i.e. Julia using column-major order, because we can always use some other <:AbstractMatrix type with optimal storage for when you need observations as contiguous blocks in memory.

5 Likes

I’m pretty much on the other side here: I think having a stats ecosystem with a separate array system would introduce a lot of other friction with conversions that are basically just transposes.

I think this issue is a good example of a place where early design decisions made in Julia reflected a lack of complete awareness of statistical computing. In the same way that Julia struggled with missing values for a long time, I think Julia’s column-major design was just in conflict with trying to please a community that fundamentally wants to iterate over rows instead of columns.

I think the situation really is as simple as:

  1. Julia already locked itself into column-major arrays as a basic principle for the language.
  2. There’s a community who want to iterate over rows more than over columns.
  3. Those people are going to pay either (a) a performance penalty forever because of (1) or (b) an abstraction penalty forever because of wrappers to make columns look like rows.

In the end, I think it’s not really not that big deal. The important thing really is having someone drive the whole community in a specific direction. This is, clearly, the major gap in the Julia Stats space these days – there’s no full time person who totally owns the space, is empowered to make executive decisions and is held accountable for driving things forward in a coherent way.

1 Like

I (author of DataFrameMacros) was thinking about this just yesterday. I took part in a big statistics summer school, and the teachers chose to use DataFrameMacros for data wrangling. I was suprised because DataFramesMeta is usually the one chosen because it’s been around for much longer, and the two are pretty similar now anyway. But it made me think about how much long-term commitment it would / should require to upload a package to the General Registry. I felt a bit like once people are using my package, I can’t let them down by not developing it anymore. But I have a daughter now and not so much time anymore, making it unlikely that I’ll always be around to support and maintain. It’s a bit different with Makie, because I’m not the only person there, but still our “bus factor” is only 2 or 3. I’m hoping that increasing adoption will also increase contributor numbers.

(And the readme docs of DataFrameMacros were supposed to be a feature, I had just previously read that many people dislike having to go to separate sites for a bit of documentation. Maybe I should rethink that :wink: )

4 Likes

In fact there’s already this Compat PR, about to celebrate its second birthday. That changes what eachcol returns.

Even without it, you could adopt eachrow etc as the API, with a little bit of fiddling to digest into whatever form you want internally:

julia> first(eachcol(rand(2,2))) isa SubArray{<:Any,1,<:Any,<:Tuple{Base.Slice, Int}}
true

(Without comment on whether this is overall a good path for “are observations rows?” issues.)

I don’t think (2) is true for all people that are active in and/or use packages from the JuliaStats organization. At least it’s not true for me, I am completely indifferent to the row/column debate but rather think that one should make use of Julias abstraction possibilities and support both in a user- and developer-friendly way. E.g., by using eachcol/eachrow as suggested above.

We can’t/don’t want to: the API is based on collection of inputs as AbstractVectors since this allows to work with data points that are neither scalars nor arrays as well without having to introduce additional methods. E.g., input data for GPs with multiple outputs is of type AbstractVector{<:Tuple{S,Int}}. However, eachcol/eachrow return a Base.Generator.

If I may, I would like to add this.

Lots of stats/econometrics have a long-established convention of how data is organized (eg. TxK where T is the number of time periods and K the number of regressors). Trying to fight that is likely to be destructive for attempts to make Julia a go-to for people with is background/schooling. I believe there are better fights to pick.

Now, internally, packages can go either way and most of us will not bother. As for perfomance, switching the axes of an array is likely to be a cheap operation compared to what follows next (linear algebra, non-linear optimization, etc).

3 Likes

I’d suggest that people don’t want to iterate on columns, OR rows, they want to iterate on atomic observations. That’s the level the API should be built at.

4 Likes

I am new to julia and I can be wrong, but as a data scientist I would like to have a few constructive criticisms to make about julia data science ecosystem,

I don’t think the problem of julia data ecosystem is about columns or rows …, i reckon it is about lack of vision. julia is a fresh approach to programing, but most of the data science packages are just same old same old every other data analysis packages. basically, julia data science ecosystem doesn’t have much to offer to the people from other worlds, for example GLM.jl as one of the fundamental tools in julia ecosystem is just the same as any other packages in other programing languages but it feels laggish because every time it recompiles. some may say it doesn’t matter for large data, but in practice it is just as good as any other packages for large data GIVEN I HAVE INFINITE AMOUNT OF MEMORY. now if I am from another world why I should leave my comfortable place? to come to a world that basically nothing is fresh or whenever I ask a question from community everyone tells me that “I’m holding my phone wrongly”

julia needs a fresh vision for its data science ecosystem, for example some data scientist can consult the developers from main data science packages (GLM, CSV, JuliaDB, DataFrames, TypedTables, Queryverse…) it can boosts developers knowledge about data, or even the old developers retire and let the fresh minds boost the ecosystem

I recently using GLM, JuliaDB, Plots, CSV and DataFrames alot, but to be honest I missed my previous life. - GLM has nothing new to offer, its documentation is minimal…, JuliaDB is not developing anymore, Plots is killing me for the first one, CSV is just thinking about the speed, and ironically the first time (and in reality the only time than I want it) it is slow. it just introduces a lot of new data types which just makes more confusions…, I don’t understand the design of DataFrames and when I am looking at h2o benchmarks is not even fast…

2 Likes

I think it makes sense to build yourself a data analysis image using Package compiler where all the CSV and GLM and plots and Distributions and everything is precompiled

DataFrames.jl is literally in the top 3 of all the h2o benchmarks, and faster than many other widely used tools. Could you elaborate a bit more? I’ll post the link here for convenience.

https://h2oai.github.io/db-benchmark/

3 Likes