Julia stats, data, ML: expanding usability

I think it makes sense to build yourself a data analysis image using Package compiler where all the CSV and GLM and plots and Distributions and everything is precompiled

DataFrames.jl is literally in the top 3 of all the h2o benchmarks, and faster than many other widely used tools. Could you elaborate a bit more? I’ll post the link here for convenience.

https://h2oai.github.io/db-benchmark/

3 Likes

I would like to request that we not digress into a discussion on compile times here. There are several other threads and discussions around that.

2 Likes

I should say it is not literally in the top 3 there is more into those results. BTW that is my point, you are into python, you have the fastest solution, even you have a solution which can almost solve any size problem. If you are into R - which is famous of being slow - : still you have a solution which is better or as good as your julia solution. instead what you offer in DataFrames?

I am not trashing julia I am just looking at it from another angle.

also, this benchmark is a bit out of date. dataframes 1.2 had some pretty nice speedups.

Regarding the design of DataFrames.jl I encourage you to open a separate thread and it would be great to discuss it. Recently we had a similar discussion here, and having such helps to improve the package and the ecosystem in general. There I propose we also can discuss the differences betwen DataFrames.jl and other ecosystems, but to just give you one of the design principles. DataFrame object is a light wrapper that stores any column you pass to it (as long as it is an AbstractVector). This flexibility has its benefits and costs, but this was the choice and design intention of original package authors:

  • To give you an example of benefit: you do not have a situation like in Polars where if you want to take a column from a data frame and use it with NumPy you should perform a conversion because their native storage format is different. Another example of benefit: we have full support of views as opposed to other ecosystems (which matters in practice when you have large data and do not have an infinite memory; this is especially relevant for wide tables).
  • To give an example of cost: in data.table one can sort a data frame by key column and then data.table sets a mark that data frame is sorted. This information is later used to speed up some operations. We cannot do such a thing in DataFrames.jl because of the flexibility we provide.

Regarding H2O benchmarks - unfortunately since mid June they are stalled (the old maintainer who was doing a great job was moved to other tasks AFAICT). I would assume that both Polars and DataFrames.jl would look differently now (these are two of the leading packages that are actively developed and have regular releases). Having said that, to repeat the comment I already made some time ago, we should not expect DataFrames.jl to be faster than e.g. Polars. Under the hood both go through LLVM infrastructure so if we would use the same algorithms the performance will be ultimately similar.

13 Likes

With Julia, the single most important thing to me is clear semantics. I know what the heck Julia code means.

After that, the composability… If I want to shove something into something else I can. Differentiate through an agent based model? Sure… Put colors into my DataFrames? Sure.

Finally, speed. It’s all compiled to machine code with special methods for each type. If I want some functionality, I write it in Julia, not in C.

If you are largely a consumer of other people’s code you are less likely to care about Julia vs Python or R. But as soon as you want to develop some functionality… You just can’t do it in Python or R, it has to be done in C or C++

8 Likes

Those of you who find macros clearer than nonstandard evaluation, can you explain why? Is it just because they are delineated by the @ sign?

I totally agree. As a biology PhD student with no experience in writing fast code, I recently achieved very big speed ups, converting some R (actually mostly C under the hood) functions that were too slow for large-ish datasets to Julia. Couldn’t/wouldn’t have even attempted this in R or Python.

Also I’m not sure what need there is for a fresh approach to GLMs? I tend to use Bayesian methods myself. My formal bayesian training was under one of the Stan core devs. But I always prefer to use Julia PPLs. However, I do sometimes need to use frequentist stats. In such cases, I can’t see anything wrong with trying to make the system largely similar to R.

6 Likes

With macros, the transformation depends only on macro and syntax, so if you see a macro you can know what expression this turns into. With non-standard evaluation, the way a function works can depend both on syntax and runtime values. So you’re never quite sure what happens, at least I always have that lingering feeling when using R.

5 Likes

Just to add (it was commented above but it is very relevant so I think it is worth stressing), with @macroexpand you can just check it.

1 Like

Much of what’s worth saying is already in Fexpr - Wikipedia (after observing the similarities between NSE and fexpr’s) and https://dl.acm.org/doi/10.1145/3359619.3359744.

2 Likes

Perfect! Thanks @johnmyleswhite . I love the fact that Kent Pitman discouraged FExpr almost my lifetime ago. I have nothing but huge respect for him and imho he is clearly correct. It’s not just the @ although having an indicator of macros is super helpful. It’s that nonstandard evaluation simply can not be analyzed by reading the code (since it depends on the runtime value of arguments).

1 Like

This has been a fascinating thread and slide deck to read. I’m primarily an R user (applied econometrics and data science) who is also pretty jazzed about a lot of the features that Julia has to offer. So, I’d like to offer some thoughts coming from that background.

  • It’s already been mentioned above, but the documentation across many Julia packages remains really quite poor. There are some important exceptions to this (e.g. DataFrames.jl is excellent), but it includes some key packages in the DS/econometrics stack and is extremely off-putting for new users. I would focus on fixing documentation before addressing any of the more abstract issues (e.g. row vs column orientation). As an aside, documentation in R was also quite poor and esoteric until about five years ago. Stata users would always point that out to me as a reason for not switching, despite other obvious advantages. I personally think some of the tidyverse benefits are oversold — compared to say, data.table — but the tidyverse and RStudio team definitely deserve plaudits for moving the needle forward here for the R ecosystem as a whole.

  • Missing values. I understand the technical barriers and conceptual breakthroughs that were needed to handle missing values in a general purpose framework. I see a lot of Julia devs quite pushy and pleased with themselves about this. But from a user perspective, missing values in Julia were real PITA when I first started experimenting with its DS ecosystem. Code that worked fine in any of the other major DS languages would fail in Julia because of an obscure missing values issue that needed to be handled explicitly. Maybe this has been sorted out since, but it ties in to my previous point about documentation. Missing values are the norm in any real world dataset and yet to find out the necessary fix I had to consult the main Julia manual instead of (a) just having the package handle it for me, or (b) having an explicit example in the package README/docs.

  • R has been able to overcome a fairly fragmented ecosystem and multiple OO paradigms — indeed, arguably actively exploit them — through a few key packages that provide standardization methods across model classes. To highlight two that make a big difference in my everyday workflow: 1) broom provides “tidiers” for extracting consistent model summaries and goodness-of-fit information in data.frame format. 2) sandwich provides variance-covariance matrix methods that make it easy to adjust standard errors for almost any model class (a big deal in econometrics). Packages like these lead to outsize downstream benefits, since e.g. it makes it easy(ier) to create packages for exporting regression tables and coefficient plots regardless of model object (which is what the also excellent modelsummary package does). I had hoped something similar could be done fairly easily in Julia because of multiple dispatch and would love to see it, regardless.

  • Earlier it was remarked that GLM.jl doesn’t offer anything beyond what can be done in equivalent routines in other languages. But for me this is a feature not a bug! Wherever possible, I want exactly the same interface and results as I’ve come to experience in, say, R. I agree, however, that precompiling canned routines (which I thought was done by default in Julia 1.6?) is important to avoid sluggish TTFP/first-time performance. Speaking of which…

  • Personally, my immediate motivation for using Julia in a project is for some bespoke computation (e.g. a structural estimation). If I’m being brutally honest, there’s no gain to be had from switching out my applied econometrics stuff for which the canned routines in R (via C and Fortran) are already at maximum performance and coverage. And… that’s fine. The interoperability between these languages is good enough that it’s no problem for me to switch between them for any one particular task. Taking a step back, I often find myself opening up Julia just to play around. It’s just an incredibly fun and performant language to work in. (Congratulations and thanks to everyone involved!) I’m particularly excited about the ease of GPU integration going forward. That stuff is much easier in Julia than R or Python and I think could be a real source of comparative advantage in the years to come.

19 Likes

Unfortunately I have to agree. This post tries to summarize the most important patterns that can be used to handle missing values in practice.

Still - it is very difficult to handle them consistently in all cases. Interested readers can comment in this PR, where we are discussing with @nalimilan what is a preferred result of sum∘skipmissing row-wise reduction applied several columns of data.

2 Likes

Exactly. Hence this : Data Access Pattern — MLDataUtils.jl v0.1 documentation

1 Like

What kind of operations are you thinking about in particular? We’re aware that missing values are annoying when working with data frames. In addition to what @bkamins mentioned, this could be improved by adding a keyword argument to propagate them automatically (see this issue), and DataFramesMeta could make this more convenient (@pdeffebach recently starting adding support for @passmissing for this, see this PR).

Support for missing values in other packages are more difficult to fix as the solutions vary. Many stats packages were written before missing existed in Base, so they don’t support them at all.

I agree we could use more missing utilities.

It’s a tough balance because while I put a lot of effort into making working with missings easier in DataFramesMeta, that can result in some lock-in which doesn’t benefit the whole community.

Overall, we could also do a better job advertising. There’s nothing wrong with just peppering your code with passmissing as needed. It’s certainly more annoying than R and Stata’s near-universal propagation, but it’s not so bad.

Recent work making missing values support better can be found in this stale PR. As you can see, it’s gotten held up by discussion of what the behavior should be. Once you get into the details it’s pretty hard to determine correct behavior. We could definitely use some expertise on what the most intuitive thing to do it.

A tension exists between covering the fundamentals and innovation.

To have any credibility at all, the language ecosystem has to cover the fundamentals. That’s a huge hill to climb – essentially implementing every statistical and machine learning method ever coded. I read a comment on another forum that said, “You can’t take Julia seriously for statistics, they don’t even have Generalized Additive Models [GAMs]!” The goto package for GAMs is in R, and it’s a big program. Support for ANOVA isn’t that great (and you can’t get more fundamental than that) – but coding a comprehensive ANOVA program is a huge amount of work.

But your point is those are boring and available elsewhere, so they are not a reason to convert to Julia. Very valid point!

From a marketing standpoint, what is Julia’s “hook”, “killer feature”, or “what they have that others are without”?

  • JuMP for optimization appears to be extremely popular
  • Using Julia for automatic differentiation

Beyond that, it’s language features

  • solves the two language problem
  • JIT compilation for speed
  • developed in the era of multiprocessing, so supports it by design

Also important are multiple dispatch and the syntax, but I think those are harder to market.

To relate this back to the original post, how could the statistics and machine learning ecosystem be improved to make it a marketing advantage for Julia? “You need to use Julia for statistics, because everything just works together without the annoyances of other languages!”

One proposal to do that was to standardize the organization of data – at least for 2 dimensions – with the columns being variables, and the rows being observations. If every stat package followed this convention, it makes it easier on the users, as they don’t have to prepare the data differently for different packages.

I would actually go beyond that, though. Support both. If the user has variables in rows, and observations in columns, just transpose it for them (although watch out for that 4 GB matrix being fed in). Multiple dispatch can really help in making functions work no matter what the user throws at it. I found it infuriating to use Python – supposedly a loosely typed language – as I would often get function errors that said, “you need to send us data as type ‘X’”. Say I sent Ints and it wanted floats. Why didn’t it just convert it for me? With multiple dispatch, this is easily done.

4 Likes

What Julia has is quite a few nice optimization routines. GAMs are just one of many approaches to fitting shapes to data. A very nice alternative is radial basis functions. There’s a nice proof that as the number of centers goes to infinity they are dense in the space of smooth functions. They are also relatively trivial to implement yourself. I’ve thought about putting together a RBF package. One particularly useful form for Bayesian inference is compact radial basis functions using translations of the smooth bump function

exp(1-1/(1-x^2)) for x in (-1,1), 0 otherwise

The nice thing about this is that it’s a localized function and so the coefficients become decorrelated at a certain radius, reducing the complexity of sampling.

I think one of the differences between the people who are attracted to Julia and the people attracted to R/Stata/SAS etc is that Julia is more of a toolkit, and R/Stata/SAS etc are more of a pushbutton calculator kinda thing.

1 Like