For the past year or so, I’ve been trying to encourage my firm to give Julia a place in our tech stack. Now that the idea has finally gained some traction, I can present a more formal case with some code samples.
To that end, what are some things that Julia has saved you a lot of time on in comparison to say, Python?
I work in finance, so any applications in that regard are preferable, but all thoughts more than welcome.
I am continuously baffled by how to do this in dplyr. The new brackets feature {{}} does not fix this. {{}} makes it easier to work with symbol literals representing columns. But that’s not ideal, we want people to be able to use variables to represent column names.
I second @pdeffebach . In things like pandas, it’s a huge monolith and honestly. 90% of things I want to do are there. But for me it’s more like the flexibility of Julia and I know I can just write the code in a way that suits my needs using various package and not have ot rely on one huge monolith where if the exact function I want is not there then it will be slow.
Other Julia niceties are in customized fast algorithms.
I’m not from the field. But I do simulations (of atoms ), and simulations always have some things in common (let’s not mention DifferentialEquations.jl…). But an overview of the things I find amazing in Julia you can find in this notebook: Particle simulations with Julia. I have no idea how to do many of those things in any other language, so I cannot really say that Julia saves time there, it just allows things to be done that I could not do otherwise.
I also work in Finance (sustainable finance) and try to promote Julia. Basically, I use different arguments for different targets:
for my manager: Pluto is a really nice way to build interactive slides deck; I use Stipples to share my modelling results also; A lot of climate models are entirely available in Julia (MimiFramework).
for my colleague (but as a Pythonista, not really convinced): you don’t have to change your problem everywhere to avoid loop / iteration; Your code can be really faster.
Also for finance, you can show some examples of portfolio optimization in JuMP / or Convex.jl.
This is a really excellent point. I think to do this in R you end up having to write some pretty convoluted and unfriendly code, which seems to negate the whole point of dplyr.
When writing code in Python the “usual” way (with loops, objects, etc.) I often get performance problems. There are ways to avoid them, e.g. by vectorizing with Numpy/Pandas, but this often makes the code more complex / difficult to understand.
With Julia you can write your code so that it resembles your problem best without loosing performance. This saves developper time and makes maintenance of code easier.
With Jupyter it is easy to have a notebook in a state where it is not reproducible anymore. This happens when you execute your cells out of order and/or change a cell after execution / execute it multiple times.
To have reproducible notebooks with Jupyter requires discipline: write your notebook such that a “Restart Kernel/ Run all” does not change its results. Furthermore, you explicitly need to save the Project.toml / Manifest.toml of your notebook environment.
Pluto, in contrast, handles both execution order consistency and dependencies automatically for you.
In addition, it is super fast to build interactive notebook “applications” in Pluto, whereas interactivity in Jupyter is a pain (both for Python and Julia based notebooks).
You can avoid a lot of trouble by type constraining your function arguments in the beginning, until you’re sure that everything works and your input arguments are always what you expect. Then you can remove types to make the code generic if you need that. I find certainty about input types very reassuring compared to python and it makes it easier to reason about code. Multiple dispatch allows you to specify multiple entry points into the same code without having one big if else block to sort out your input arguments. And other people can extend your code easily, even if it’s just their own convenience entry points with their own types and not full blown pipelines on special types.
I love julia and it’s awesome composability, but I don’t think data wrangling is one of it’s strengths compared to R. Just my personal opinion of course.
I would focus on the freedom you have in julia to build really powerful and novel algorithms and ML models without the constraints of a specialized super optimized framework available.
I used two different functions! And was able to keep the nice declarative syntax, i.e. actually writing out $var = $var * 100. I don’t think it’s as easy. Plus with DataFramesMeta.jl I can create new variables easily, not just overwrite existing ones.
julia> function clean_data!(df, var1, var2)
@chain df begin
@rtransform! $(var1 * "_cleaned") = $var1 * 100
@rtransform! $(var2 * "_cleaned") = $var2 * 200
end
end;
Yeah, I don’t think you’ll easily convince a proficient tidyverse user that they should jump ship for better data wrangling. Not saying it’s impossible, as it really depends on what they value in a language. But that’s not going to be a generally easy sell to ppl who are basically happy with R.
This may just be personal taste but I don’t see how this is any worse. I’m not beating on DataFramesMeta.jl, I’m just saying this is really not hard in R, and claiming that it is might not be the best way to promote julia. My two cents.
I find Python type annotations and checkers very helpful but Julia’s annotations are inexpressive (eg no Iterable{T}) and people often warn against using them.
That’s not the dplyr though. You might prefer Base R, for those kinds of cleaning operations, but dplyr has certainly struck a chord and lots of people like the piping verb structure.
My example above was very minimal, showing trivial transformations. In DataFramesMeta.jl you can construct much large blocks of operations, subsetting, filtering, split-apply-combine, etc. dplyr provides great syntax for these large blocks, but you have to sacrifice some flexibility working with things programmatically.
now, try to do this weird 10% of computation that can’t quite be done with the existing columnar (C/Fortran backend) functions. No big deal, you say, and wrote a for loop, and the performance is horrible.
You don’t pay 80% of the cost dealing with the last 20% of niche/DIY tasks if you picked Julia.
Either way there’s a lot of ways of working with this in modern days in R. You can read more about it by calling vignette("programming", "dplyr") in R.
I think it’s fair to say that you can do everything you can do with DataFrames / DataFramesMeta with dplyr, if you know how. Looking at your examples there are definitely different underlying principles at play how these libraries are designed. My personal preference is a bit towards DataFramesMeta because the macros are translatable to function calls and those function calls then follow normal Julia rules. Your example shows a few different examples of “magic” which can both delight and confuse, for me it was usually more confusing when I was using R. But that seems mostly to be a matter of taste, which is hard to argue about.
Completely agree. It’s really about preference in the end. That’s why I normally advocate julia in performance areas rather than data wrangling. In the end we’re all trying to advocate julia the best way we can.