TLDR
Is Julia ready as a full-blown language for doing data science? Like Python and R.
Hell yeah! But it’s got some rough edges still.
The long bits
I want to detail a recent experience I had with trying to use Julia as the only language for a data science project. In many ways, the project is a vanilla offline project. You got given a dataset, and a column containing the target, and your task is to build a model.
So firstly, the data wrangling. As I have tweeted, I have been using the Trinity consisting of DataFrames.jl, Chain.jl, and DataFrameMacros.jl for manipulating tabular data. That works really well. I can write readable code and the experience is very pleasant except for a few rough edges in DataFrameMacros.jl.
I could unpack all sites into columns using this
transform(:sites => ByRow(onehot_sites) => ["site"*string(i) for i in 1:length(UNIQUE_SITES)])
which I ended up not doing as it made subsequent steps slower (I recall printing was slower). But the above showed me the power of DataFrames.jl. Once I defined onehot_sites
to return a vector it’s just magic. I wouldn’t really know how to do this efficiently in R or Python.
The data also comes with a column of JSON, so JSON3.jl came to the rescue. The JSON column contains websites the user has visited. So there could be multiple websites in one JSON. I found unpacking the JSON to be very easy to do in DataFrames.jl although the code readability may not be the best.
This is the one-liner I had used to extract all the sites from the data
all_sites = mapreduce(jsons->[json["site"] for json in jsons], vcat, dataset_post_first_set_of_filter1.sites)
For other columns, I needed to do some one-hot encoding. So I looked around and ended up using MLJ.jl as I didn’t have anything else that was too convenient. MLJ.jl definitely has a learning curve, e.g. what’s the idea behind a machine? And also, if the column is of type Union{Missing, T}
then it doesn’t process it until you get rid of the missing
from the column and disallowmissing
to change the type back to T
. The error message could’ve been better too.
I didn’t know about FeatureTransforms.jl and had tried AutoMLPipelines.jl but despite its problems, MLJ.jl was still ok. Obviously, Flux would be too hard-core for these simple things. Anyway, I would love to play around with a coherent simple to use data processing library in pure Julia. something less heavy than MLJ.jl with all the heavy machinery and things.
The original file also came in Zip so I used ZipFile.jl and it was pleasant. I do recall having had issues with performance with ZipFile.jl in the past though so YMMV.
I ended up cataloging every website that appears in the dataset and created a column for each website. So I ended up with thousands of columns but DataFrames.jl handled it remarkably well! And I couldn’t be happier with the result.
However, I really struggled with IJulia.jl. It just flat out doesn’t work on my windows machine. It kinda works in WSL2 but you can’t start it from Julia or the browser tab for it will never appear. You actually need to go to bash and jupyter notebook
it to start it. Thank god the Julia kernel worked though in WSL2.
Modeling was another issue. It was actually pretty hard trying to find a decent modeling setup. MLJ.jl was too heavy for my liking so I ended up using EvoTrees.jl which I think is a decent implementation of boosting trees in pure Julia. But I ended up having to write my own CV which wasn’t too bad as I had made it simple in about 20 lines of code.
EvoTrees.jl relied on MLJ.jl to provide access to a wide variety of inputs types. But I found it inconvenient to use MLJ.jl so I just manually converted my DataFrame to Matrix type.
I decided to be lazy and tried to use EvalMetric.jl ROC computation though. That was my biggest mistake. I did a hyperparameter grid search thinking it will be over in an hour but it took well over 10 hours, and I think it’s the inefficient ROC calculation in EvalMetrics.jl. Yeah, the package worked really well otherwise, even the doc site has broken formatting last time I checked :). In hindsight, I should have used random search and tried something like HyperOpt.jl.
Now to plotting. I decided to plot the cv results from all the folds, and Plots.jl worked beautifully. It’s quite intuitive compared to ggplot2. And I didn’t have to use something like {patchwork } to do a simple layout.
Next, I tried to do some optimization with how best to choose a cut-off and supplying a cost matrix, etc. I used Optim.jl which was pleasant enough. But I do recall looking for ages on how to do optimization within a boundary. The docs were just not very friendly IMO.
I then finished everything off by building a scoring function and I wish MLJ.jl had better guardrails like how to handle missing categories for One-hot-encoding at scoring/predictions time. But it’s a rough edge we can live with.
Finally, I saved the scoring output as a CSV using CSV.jl. Pretty pleasant nothing much to say except that CSV.jl is great.
But overall, everything just worked with some rough edges. I’d say Julia is definitely ready as a full-blown data science language. Especially if you are just doing the normal offline variety!
Side gripe, I really like notebooks less and less. If it wasn’t a requirement for this one. I would not have used Jupyter.
How does it compare with Python and R
Apart from the modeling part and the heavy-ness of MLJ.jl, I actually prefer doing data science in Julia!. Cos it just feels right. The data manipulation is more intuitive without having to ham fist vectorized patterns everywhere, and the overall feel is that because everything is more composable I can be creative in how I approach a problem.