Julia stats, data, ML: expanding usability

grantmcdermott · September 12, 2021, 5:10pm

This has been a fascinating thread and slide deck to read. I’m primarily an R user (applied econometrics and data science) who is also pretty jazzed about a lot of the features that Julia has to offer. So, I’d like to offer some thoughts coming from that background.

It’s already been mentioned above, but the documentation across many Julia packages remains really quite poor. There are some important exceptions to this (e.g. DataFrames.jl is excellent), but it includes some key packages in the DS/econometrics stack and is extremely off-putting for new users. I would focus on fixing documentation before addressing any of the more abstract issues (e.g. row vs column orientation). As an aside, documentation in R was also quite poor and esoteric until about five years ago. Stata users would always point that out to me as a reason for not switching, despite other obvious advantages. I personally think some of the tidyverse benefits are oversold — compared to say, data.table — but the tidyverse and RStudio team definitely deserve plaudits for moving the needle forward here for the R ecosystem as a whole.
Missing values. I understand the technical barriers and conceptual breakthroughs that were needed to handle missing values in a general purpose framework. I see a lot of Julia devs quite pushy and pleased with themselves about this. But from a user perspective, missing values in Julia were real PITA when I first started experimenting with its DS ecosystem. Code that worked fine in any of the other major DS languages would fail in Julia because of an obscure missing values issue that needed to be handled explicitly. Maybe this has been sorted out since, but it ties in to my previous point about documentation. Missing values are the norm in any real world dataset and yet to find out the necessary fix I had to consult the main Julia manual instead of (a) just having the package handle it for me, or (b) having an explicit example in the package README/docs.
R has been able to overcome a fairly fragmented ecosystem and multiple OO paradigms — indeed, arguably actively exploit them — through a few key packages that provide standardization methods across model classes. To highlight two that make a big difference in my everyday workflow: 1) broom provides “tidiers” for extracting consistent model summaries and goodness-of-fit information in data.frame format. 2) sandwich provides variance-covariance matrix methods that make it easy to adjust standard errors for almost any model class (a big deal in econometrics). Packages like these lead to outsize downstream benefits, since e.g. it makes it easy(ier) to create packages for exporting regression tables and coefficient plots regardless of model object (which is what the also excellent modelsummary package does). I had hoped something similar could be done fairly easily in Julia because of multiple dispatch and would love to see it, regardless.
Earlier it was remarked that GLM.jl doesn’t offer anything beyond what can be done in equivalent routines in other languages. But for me this is a feature not a bug! Wherever possible, I want exactly the same interface and results as I’ve come to experience in, say, R. I agree, however, that precompiling canned routines (which I thought was done by default in Julia 1.6?) is important to avoid sluggish TTFP/first-time performance. Speaking of which…
Personally, my immediate motivation for using Julia in a project is for some bespoke computation (e.g. a structural estimation). If I’m being brutally honest, there’s no gain to be had from switching out my applied econometrics stuff for which the canned routines in R (via C and Fortran) are already at maximum performance and coverage. And… that’s fine. The interoperability between these languages is good enough that it’s no problem for me to switch between them for any one particular task. Taking a step back, I often find myself opening up Julia just to play around. It’s just an incredibly fun and performant language to work in. (Congratulations and thanks to everyone involved!) I’m particularly excited about the ease of GPU integration going forward. That stuff is much easier in Julia than R or Python and I think could be a real source of comparative advantage in the years to come.

Topic		Replies	Views
Request for un'stdlibfication of Statistics Internals & Design statistics , community	78	6346	September 10, 2022
[ANN] New and Improved JuliaDB Community package , announcement	14	2808	August 7, 2018
Julia Ecosystem (respecting hierarchy and common API) - Statistical Models Internals & Design	4	1381	July 19, 2017
Pushing Julia/statistics development Statistics	14	6116	August 8, 2022
Julia as a universal platform for statistical software development Community announcement	14	2186	April 19, 2024

Julia stats, data, ML: expanding usability

Related topics