Recent experience with Julia as the main data science driver

TLDR

Is Julia ready as a full-blown language for doing data science? Like Python and R.

Hell yeah! But it’s got some rough edges still.

The long bits

I want to detail a recent experience I had with trying to use Julia as the only language for a data science project. In many ways, the project is a vanilla offline project. You got given a dataset, and a column containing the target, and your task is to build a model.

So firstly, the data wrangling. As I have tweeted, I have been using the Trinity consisting of DataFrames.jl, Chain.jl, and DataFrameMacros.jl for manipulating tabular data. That works really well. I can write readable code and the experience is very pleasant except for a few rough edges in DataFrameMacros.jl.

I could unpack all sites into columns using this

transform(:sites => ByRow(onehot_sites) => ["site"*string(i) for i in 1:length(UNIQUE_SITES)])

which I ended up not doing as it made subsequent steps slower (I recall printing was slower). But the above showed me the power of DataFrames.jl. Once I defined onehot_sites to return a vector it’s just magic. I wouldn’t really know how to do this efficiently in R or Python.

The data also comes with a column of JSON, so JSON3.jl came to the rescue. The JSON column contains websites the user has visited. So there could be multiple websites in one JSON. I found unpacking the JSON to be very easy to do in DataFrames.jl although the code readability may not be the best.

This is the one-liner I had used to extract all the sites from the data

all_sites = mapreduce(jsons->[json["site"] for json in jsons], vcat, dataset_post_first_set_of_filter1.sites)

For other columns, I needed to do some one-hot encoding. So I looked around and ended up using MLJ.jl as I didn’t have anything else that was too convenient. MLJ.jl definitely has a learning curve, e.g. what’s the idea behind a machine? And also, if the column is of type Union{Missing, T} then it doesn’t process it until you get rid of the missing from the column and disallowmissing to change the type back to T. The error message could’ve been better too.

I didn’t know about FeatureTransforms.jl and had tried AutoMLPipelines.jl but despite its problems, MLJ.jl was still ok. Obviously, Flux would be too hard-core for these simple things. Anyway, I would love to play around with a coherent simple to use data processing library in pure Julia. something less heavy than MLJ.jl with all the heavy machinery and things.

The original file also came in Zip so I used ZipFile.jl and it was pleasant. I do recall having had issues with performance with ZipFile.jl in the past though so YMMV.

I ended up cataloging every website that appears in the dataset and created a column for each website. So I ended up with thousands of columns but DataFrames.jl handled it remarkably well! And I couldn’t be happier with the result.

However, I really struggled with IJulia.jl. It just flat out doesn’t work on my windows machine. It kinda works in WSL2 but you can’t start it from Julia or the browser tab for it will never appear. You actually need to go to bash and jupyter notebook it to start it. Thank god the Julia kernel worked though in WSL2.

Modeling was another issue. It was actually pretty hard trying to find a decent modeling setup. MLJ.jl was too heavy for my liking so I ended up using EvoTrees.jl which I think is a decent implementation of boosting trees in pure Julia. But I ended up having to write my own CV which wasn’t too bad as I had made it simple in about 20 lines of code.

EvoTrees.jl relied on MLJ.jl to provide access to a wide variety of inputs types. But I found it inconvenient to use MLJ.jl so I just manually converted my DataFrame to Matrix type.

I decided to be lazy and tried to use EvalMetric.jl ROC computation though. That was my biggest mistake. I did a hyperparameter grid search thinking it will be over in an hour but it took well over 10 hours, and I think it’s the inefficient ROC calculation in EvalMetrics.jl. Yeah, the package worked really well otherwise, even the doc site has broken formatting last time I checked :). In hindsight, I should have used random search and tried something like HyperOpt.jl.

Now to plotting. I decided to plot the cv results from all the folds, and Plots.jl worked beautifully. It’s quite intuitive compared to ggplot2. And I didn’t have to use something like {patchwork } to do a simple layout.

Next, I tried to do some optimization with how best to choose a cut-off and supplying a cost matrix, etc. I used Optim.jl which was pleasant enough. But I do recall looking for ages on how to do optimization within a boundary. The docs were just not very friendly IMO.

I then finished everything off by building a scoring function and I wish MLJ.jl had better guardrails like how to handle missing categories for One-hot-encoding at scoring/predictions time. But it’s a rough edge we can live with.

Finally, I saved the scoring output as a CSV using CSV.jl. Pretty pleasant nothing much to say except that CSV.jl is great.

But overall, everything just worked with some rough edges. I’d say Julia is definitely ready as a full-blown data science language. Especially if you are just doing the normal offline variety!

Side gripe, I really like notebooks less and less. If it wasn’t a requirement for this one. I would not have used Jupyter.

How does it compare with Python and R

Apart from the modeling part and the heavy-ness of MLJ.jl, I actually prefer doing data science in Julia!. Cos it just feels right. The data manipulation is more intuitive without having to ham fist vectorized patterns everywhere, and the overall feel is that because everything is more composable I can be creative in how I approach a problem.

37 Likes

That’s odd about IJulia, worth starting a thread about maybe as I use it almost everyday on Windows and have with every version since about 1.2 (including 1.7beta), with multiple kernelspecs as well (for multiple threads), and never had any issues.

If you want something that is very light-weight and in pure Julia, maybe give SPGBox a try.

2 Likes

Nice write up!
I think these detailed user stories are very useful to evaluate current strengths and point out pain points and improvements.
Thanks for sharing.

3 Likes

The Insider version of the Julia extension for VS Code has native Jupyter notebook support that does not require installation of IJulia.jl, or Jupyter or anything Python. There is also zero configuration involved, it should just be enough to have Julia, VS Code and the Julia and Jupyter extension on your system and then everything should just work. At the moment there might still be some hiccups, i.e. we are in the middle of ironing out the last problems, but this will all ship very soon and will hopefully provide a very robust way to use Jupyter notebooks.

31 Likes

FYI& FWIW IJulia not working on Julia 1.61. on Windows 10 · Issue #1002 · JuliaLang/IJulia.jl · GitHub

1 Like

seems to require the definition of a g! for gradient which is inconvenient. I wish g! to be inferred (AD) or can be skipped.

1 Like

You can of course compute the gradient with some of the automatic differentiation packages of the Julia ecosystem (a simple example is here: User guide · SPGBox.jl)

When I first saw Julia, I was absolutely excited with the idea that it could be the ultimate replacement for R and Python. Elegant, powerful, rigorous, consistent Julia. Every year I would evaluate it to watch its… not progress. Graphing packages came and went. Killer apps like Turing were only performant if you were really good at optimizing Julia. There were a couple of alternatives for Data Frames, neither of which was very polished. And so on.

This year I had given up. So thanks for some encouragement to evaluate it one more time.

At the same time, I have to say your caveat “Apart from the modeling part…” covers a lot of ground. That’s one of the three legs of the data science stool, so that feels like what you’re describing goes beyond some rough edges. True, data manipulation, cleaning, etc, can take the vast majority of your time if you’re starting from ground zero, and if that’s the case Julia is very nice. But…

I’ll evaluate it again, but the real test of Julia’s suitability for data science isn’t: are you able to do some data science in it? It’s: could you convincingly persuade the folks in a new data science engagement that Julia is amazing and there’s no risk to having it as your primary data science tool? I don’t like Python, but I have no hesitation about selling someone on it being our primary tool for a set of data science tasks. Same with R. (R’s a slightly harder sell now-a-days because it doesn’t directly support deep neural networks and because so many recent grads from data science programs only know Python and have been told that R is old-school.)

I would be nervous persuading someone that our primary tool should be Julia. Maybe for a very specific niche like ODE + DNN. Or maybe in a Julia shop. But in an engagement that’s somewhat open-ended, where Julia would be something new, where exploration speed and flexibility is of the essence, I wouldn’t be willing to put my neck on the line to get it.

I wish that were not true… and maybe the last year has been so revolutionary that I’m way behind the times. You’ve inspired me to go explore, and I do with some amount of hopefulness.

I think it’s now more or less DataFrames.jl

The other caveat is that my skills in this area are developing. I’ve only used {caret} and for most of my career, I didn’t have to touch many of these tools. So I suspect, it would be a very different experience if you were an MLJ.jl expert for example.

I’d echo that somewhat, as I found the basic data science toolkit wanting. E.g. GLM.jl and PCA doesnt’ work as well as those in R.

The {torch} package looks alright.

Personally, for risk reason, I would also choose something like python with scikitlearn as it’s “safer”. But I am really not happy with scikit learn. But Julia’s basic data science ecosystem doesn’t “feel” mature enough yet. But it’s a numbers game. Someone might come along and make it really to use.

It’s OK but it feels very strange for R programmers. I showed it to some students who mostly learn R and it was very hard to build on that. Mostly object orientation VS functional but oop in r feels bad to me in any case.

What is your experience blending Julia with R or Python as needed using RCall and PyCall? Does that fit with your workflow or does it become too cumbersome?

It’s not bad. Although, I haven’t used PyCall.jl and have used RCall.jl. It’s pretty smooth as long as you don’t try to pass massive amounts of data between Julia and R.

Hi Wayne @wfolta , I was in a similar boat with Julia up to about mid 2019, at which point it became obvious that Julia was kicking into gear. By 2020 I was carrying out a number of analyses in Julia, and enjoying it. Now I think it’s very workable. But, I avoid all the neural networks and boosted trees and all that, I’m doing visualization and exploration as well as mechanistic Bayesian models and either Turing or Stan seems good for that. I’m also trying to build some samplers that are efficient in gradient free situations but that’s research level stuff not available yet.

A big part of all of it for me is that I can do real computing in Julia. For example I can design agent based models that are fast to execute. I can do ODEs, I can do PDEs etc.

3 Likes

Here’s another little point. I just tried Julia in VS Code again (I’d been using CLI and not graphing for a while.) Well, my plot comes out tiny, using StatsPlots. (The Plot Navigation thumbnails are significantly larger than the actual plots.) So I read somewhere about size=(x,y) and end up having to use size=(3200,1600, which gives reasonable-sized graphs but absolutely minuscule text and very fine lines. So that’s not really the answer.

That’s worse than Python in Jupiter. (Which is worse than R in Rstudio.) The reason I dislike Python so much – besides poorly-designed software like Pandas – is the lack of a user-centric focus. Just trying to make a basic graph should be trivial. It’s essentially the “Hello World” of any data science tool in a GUI.

Is this a problem with Julia and Plots, or with VS Code? When I do:

using StatsPlots
histogram(randn(1000))

It pops up a windows as follows:

image

Which doesn’t seem to be stupid in the way you mention.

I wonder why u experience this? Cos never experienced this myself

I think this was due to an upstream issue with GR, which I believe has been fixed. So if you update your packages, it might work like normal again…

@wfolta

Currently I’d say Makie for plotting and DataFrames for dataframes.

As for “amazing”, I don’t think that’s especially realistic. It’s a new language with a small community. It has some unique characteristics like every language has, but it’s just another language. If you know a bunch of languages, it will be familiar from what you already know with some mostly small pluses and minuses.

Remember that Python and Julia aren’t data analysis DSLs but general purpose programming languages. They make different tradeoffs than RStudio. ggplot2 is still the best at what it does.

1 Like