Julia stats, data, ML: expanding usability

I’m pretty much on the other side here: I think having a stats ecosystem with a separate array system would introduce a lot of other friction with conversions that are basically just transposes.

I think this issue is a good example of a place where early design decisions made in Julia reflected a lack of complete awareness of statistical computing. In the same way that Julia struggled with missing values for a long time, I think Julia’s column-major design was just in conflict with trying to please a community that fundamentally wants to iterate over rows instead of columns.

I think the situation really is as simple as:

  1. Julia already locked itself into column-major arrays as a basic principle for the language.
  2. There’s a community who want to iterate over rows more than over columns.
  3. Those people are going to pay either (a) a performance penalty forever because of (1) or (b) an abstraction penalty forever because of wrappers to make columns look like rows.

In the end, I think it’s not really not that big deal. The important thing really is having someone drive the whole community in a specific direction. This is, clearly, the major gap in the Julia Stats space these days – there’s no full time person who totally owns the space, is empowered to make executive decisions and is held accountable for driving things forward in a coherent way.

1 Like

I (author of DataFrameMacros) was thinking about this just yesterday. I took part in a big statistics summer school, and the teachers chose to use DataFrameMacros for data wrangling. I was suprised because DataFramesMeta is usually the one chosen because it’s been around for much longer, and the two are pretty similar now anyway. But it made me think about how much long-term commitment it would / should require to upload a package to the General Registry. I felt a bit like once people are using my package, I can’t let them down by not developing it anymore. But I have a daughter now and not so much time anymore, making it unlikely that I’ll always be around to support and maintain. It’s a bit different with Makie, because I’m not the only person there, but still our “bus factor” is only 2 or 3. I’m hoping that increasing adoption will also increase contributor numbers.

(And the readme docs of DataFrameMacros were supposed to be a feature, I had just previously read that many people dislike having to go to separate sites for a bit of documentation. Maybe I should rethink that :wink: )

4 Likes

In fact there’s already this Compat PR, about to celebrate its second birthday. That changes what eachcol returns.

Even without it, you could adopt eachrow etc as the API, with a little bit of fiddling to digest into whatever form you want internally:

julia> first(eachcol(rand(2,2))) isa SubArray{<:Any,1,<:Any,<:Tuple{Base.Slice, Int}}
true

(Without comment on whether this is overall a good path for “are observations rows?” issues.)

I don’t think (2) is true for all people that are active in and/or use packages from the JuliaStats organization. At least it’s not true for me, I am completely indifferent to the row/column debate but rather think that one should make use of Julias abstraction possibilities and support both in a user- and developer-friendly way. E.g., by using eachcol/eachrow as suggested above.

We can’t/don’t want to: the API is based on collection of inputs as AbstractVectors since this allows to work with data points that are neither scalars nor arrays as well without having to introduce additional methods. E.g., input data for GPs with multiple outputs is of type AbstractVector{<:Tuple{S,Int}}. However, eachcol/eachrow return a Base.Generator.

If I may, I would like to add this.

Lots of stats/econometrics have a long-established convention of how data is organized (eg. TxK where T is the number of time periods and K the number of regressors). Trying to fight that is likely to be destructive for attempts to make Julia a go-to for people with is background/schooling. I believe there are better fights to pick.

Now, internally, packages can go either way and most of us will not bother. As for perfomance, switching the axes of an array is likely to be a cheap operation compared to what follows next (linear algebra, non-linear optimization, etc).

3 Likes

I’d suggest that people don’t want to iterate on columns, OR rows, they want to iterate on atomic observations. That’s the level the API should be built at.

4 Likes

I am new to julia and I can be wrong, but as a data scientist I would like to have a few constructive criticisms to make about julia data science ecosystem,

I don’t think the problem of julia data ecosystem is about columns or rows …, i reckon it is about lack of vision. julia is a fresh approach to programing, but most of the data science packages are just same old same old every other data analysis packages. basically, julia data science ecosystem doesn’t have much to offer to the people from other worlds, for example GLM.jl as one of the fundamental tools in julia ecosystem is just the same as any other packages in other programing languages but it feels laggish because every time it recompiles. some may say it doesn’t matter for large data, but in practice it is just as good as any other packages for large data GIVEN I HAVE INFINITE AMOUNT OF MEMORY. now if I am from another world why I should leave my comfortable place? to come to a world that basically nothing is fresh or whenever I ask a question from community everyone tells me that “I’m holding my phone wrongly”

julia needs a fresh vision for its data science ecosystem, for example some data scientist can consult the developers from main data science packages (GLM, CSV, JuliaDB, DataFrames, TypedTables, Queryverse…) it can boosts developers knowledge about data, or even the old developers retire and let the fresh minds boost the ecosystem

I recently using GLM, JuliaDB, Plots, CSV and DataFrames alot, but to be honest I missed my previous life. - GLM has nothing new to offer, its documentation is minimal…, JuliaDB is not developing anymore, Plots is killing me for the first one, CSV is just thinking about the speed, and ironically the first time (and in reality the only time than I want it) it is slow. it just introduces a lot of new data types which just makes more confusions…, I don’t understand the design of DataFrames and when I am looking at h2o benchmarks is not even fast…

2 Likes

I think it makes sense to build yourself a data analysis image using Package compiler where all the CSV and GLM and plots and Distributions and everything is precompiled

DataFrames.jl is literally in the top 3 of all the h2o benchmarks, and faster than many other widely used tools. Could you elaborate a bit more? I’ll post the link here for convenience.

https://h2oai.github.io/db-benchmark/

3 Likes

I would like to request that we not digress into a discussion on compile times here. There are several other threads and discussions around that.

2 Likes

I should say it is not literally in the top 3 there is more into those results. BTW that is my point, you are into python, you have the fastest solution, even you have a solution which can almost solve any size problem. If you are into R - which is famous of being slow - : still you have a solution which is better or as good as your julia solution. instead what you offer in DataFrames?

I am not trashing julia I am just looking at it from another angle.

also, this benchmark is a bit out of date. dataframes 1.2 had some pretty nice speedups.

Regarding the design of DataFrames.jl I encourage you to open a separate thread and it would be great to discuss it. Recently we had a similar discussion here, and having such helps to improve the package and the ecosystem in general. There I propose we also can discuss the differences betwen DataFrames.jl and other ecosystems, but to just give you one of the design principles. DataFrame object is a light wrapper that stores any column you pass to it (as long as it is an AbstractVector). This flexibility has its benefits and costs, but this was the choice and design intention of original package authors:

  • To give you an example of benefit: you do not have a situation like in Polars where if you want to take a column from a data frame and use it with NumPy you should perform a conversion because their native storage format is different. Another example of benefit: we have full support of views as opposed to other ecosystems (which matters in practice when you have large data and do not have an infinite memory; this is especially relevant for wide tables).
  • To give an example of cost: in data.table one can sort a data frame by key column and then data.table sets a mark that data frame is sorted. This information is later used to speed up some operations. We cannot do such a thing in DataFrames.jl because of the flexibility we provide.

Regarding H2O benchmarks - unfortunately since mid June they are stalled (the old maintainer who was doing a great job was moved to other tasks AFAICT). I would assume that both Polars and DataFrames.jl would look differently now (these are two of the leading packages that are actively developed and have regular releases). Having said that, to repeat the comment I already made some time ago, we should not expect DataFrames.jl to be faster than e.g. Polars. Under the hood both go through LLVM infrastructure so if we would use the same algorithms the performance will be ultimately similar.

13 Likes

With Julia, the single most important thing to me is clear semantics. I know what the heck Julia code means.

After that, the composability… If I want to shove something into something else I can. Differentiate through an agent based model? Sure… Put colors into my DataFrames? Sure.

Finally, speed. It’s all compiled to machine code with special methods for each type. If I want some functionality, I write it in Julia, not in C.

If you are largely a consumer of other people’s code you are less likely to care about Julia vs Python or R. But as soon as you want to develop some functionality… You just can’t do it in Python or R, it has to be done in C or C++

8 Likes

Those of you who find macros clearer than nonstandard evaluation, can you explain why? Is it just because they are delineated by the @ sign?

I totally agree. As a biology PhD student with no experience in writing fast code, I recently achieved very big speed ups, converting some R (actually mostly C under the hood) functions that were too slow for large-ish datasets to Julia. Couldn’t/wouldn’t have even attempted this in R or Python.

Also I’m not sure what need there is for a fresh approach to GLMs? I tend to use Bayesian methods myself. My formal bayesian training was under one of the Stan core devs. But I always prefer to use Julia PPLs. However, I do sometimes need to use frequentist stats. In such cases, I can’t see anything wrong with trying to make the system largely similar to R.

6 Likes

With macros, the transformation depends only on macro and syntax, so if you see a macro you can know what expression this turns into. With non-standard evaluation, the way a function works can depend both on syntax and runtime values. So you’re never quite sure what happens, at least I always have that lingering feeling when using R.

5 Likes

Just to add (it was commented above but it is very relevant so I think it is worth stressing), with @macroexpand you can just check it.

1 Like

Much of what’s worth saying is already in Fexpr - Wikipedia (after observing the similarities between NSE and fexpr’s) and https://dl.acm.org/doi/10.1145/3359619.3359744.

2 Likes

Perfect! Thanks @johnmyleswhite . I love the fact that Kent Pitman discouraged FExpr almost my lifetime ago. I have nothing but huge respect for him and imho he is clearly correct. It’s not just the @ although having an indicator of macros is super helpful. It’s that nonstandard evaluation simply can not be analyzed by reading the code (since it depends on the runtime value of arguments).

1 Like