Things that are easier in Julia than Python/R etc

Derek_Vetsch · October 8, 2021, 12:08am

For the past year or so, I’ve been trying to encourage my firm to give Julia a place in our tech stack. Now that the idea has finally gained some traction, I can present a more formal case with some code samples.

To that end, what are some things that Julia has saved you a lot of time on in comparison to say, Python?

I work in finance, so any applications in that regard are preferable, but all thoughts more than welcome.

pdeffebach · October 8, 2021, 12:41am

DataFramesMeta.jl is designed to make something a lot easier to do in Julia than in dplyr: working with variables programmatically.

julia> using DataFramesMeta;

julia> df = DataFrame(x1 = [1, 2], x2 = [3, 4]);

julia> function clean_data!(df, var1, var2)
           @chain df begin
               @rtransform! $var1 = $var1 * 100
               @rtransform! $var2 = $var2 * 200
           end
       end;

julia> v1 = "x1"; v2 = "x2";

julia> clean_data!(df, v1, v2)
2×2 DataFrame
 Row │ x1     x2
     │ Int64  Int64
─────┼──────────────
   1 │   100    600
   2 │   200    800

I am continuously baffled by how to do this in dplyr. The new brackets feature {{}} does not fix this. {{}} makes it easier to work with symbol literals representing columns. But that’s not ideal, we want people to be able to use variables to represent column names.

xiaodai · October 8, 2021, 12:43am

I second @pdeffebach . In things like pandas, it’s a huge monolith and honestly. 90% of things I want to do are there. But for me it’s more like the flexibility of Julia and I know I can just write the code in a way that suits my needs using various package and not have ot rely on one huge monolith where if the exact function I want is not there then it will be slow.

Other Julia niceties are in customized fast algorithms.

lmiq · October 8, 2021, 1:56am

I’m not from the field. But I do simulations (of atoms ), and simulations always have some things in common (let’s not mention DifferentialEquations.jl…). But an overview of the things I find amazing in Julia you can find in this notebook: Particle simulations with Julia. I have no idea how to do many of those things in any other language, so I cannot really say that Julia saves time there, it just allows things to be done that I could not do otherwise.

tlorans · October 8, 2021, 4:43am

I also work in Finance (sustainable finance) and try to promote Julia. Basically, I use different arguments for different targets:

for my manager: Pluto is a really nice way to build interactive slides deck; I use Stipples to share my modelling results also; A lot of climate models are entirely available in Julia (MimiFramework).
for my colleague (but as a Pythonista, not really convinced): you don’t have to change your problem everywhere to avoid loop / iteration; Your code can be really faster.

Also for finance, you can show some examples of portfolio optimization in JuMP / or Convex.jl.

Derek_Vetsch · October 8, 2021, 4:52am

This is a really excellent point. I think to do this in R you end up having to write some pretty convoluted and unfriendly code, which seems to negate the whole point of dplyr.

lungben · October 8, 2021, 6:46am

I second both points.

When writing code in Python the “usual” way (with loops, objects, etc.) I often get performance problems. There are ways to avoid them, e.g. by vectorizing with Numpy/Pandas, but this often makes the code more complex / difficult to understand.
With Julia you can write your code so that it resembles your problem best without loosing performance. This saves developper time and makes maintenance of code easier.

With Jupyter it is easy to have a notebook in a state where it is not reproducible anymore. This happens when you execute your cells out of order and/or change a cell after execution / execute it multiple times.
To have reproducible notebooks with Jupyter requires discipline: write your notebook such that a “Restart Kernel/ Run all” does not change its results. Furthermore, you explicitly need to save the Project.toml / Manifest.toml of your notebook environment.
Pluto, in contrast, handles both execution order consistency and dependencies automatically for you.
In addition, it is super fast to build interactive notebook “applications” in Pluto, whereas interactivity in Jupyter is a pain (both for Python and Julia based notebooks).

jules · October 8, 2021, 8:26am

You can avoid a lot of trouble by type constraining your function arguments in the beginning, until you’re sure that everything works and your input arguments are always what you expect. Then you can remove types to make the code generic if you need that. I find certainty about input types very reassuring compared to python and it makes it easier to reason about code. Multiple dispatch allows you to specify multiple entry points into the same code without having one big if else block to sort out your input arguments. And other people can extend your code easily, even if it’s just their own convenience entry points with their own types and not full blown pipelines on special types.

DoktorMike · October 8, 2021, 10:55am

It’s actually very simple in R tbh…

library(tidyverse)
df <- tibble(x1 = c(1,2), x2 = c(3,4)) 
clean_data <- function(df, var1, var2) {
	mul100 <- function(x) x*100
	df %>% mutate_at(vars(var1, var2), mul100)
}
clean_data(df, "x1", "x2")

# A tibble: 2 × 2
     x1    x2
  <dbl> <dbl>
1   100   300
2   200   400

I love julia and it’s awesome composability, but I don’t think data wrangling is one of it’s strengths compared to R. Just my personal opinion of course.

I would focus on the freedom you have in julia to build really powerful and novel algorithms and ML models without the constraints of a specialized super optimized framework available.

pdeffebach · October 8, 2021, 11:54am

I used two different functions! And was able to keep the nice declarative syntax, i.e. actually writing out $var = $var * 100. I don’t think it’s as easy. Plus with DataFramesMeta.jl I can create new variables easily, not just overwrite existing ones.

julia> function clean_data!(df, var1, var2)
           @chain df begin
               @rtransform! $(var1 * "_cleaned") = $var1 * 100
               @rtransform! $(var2 * "_cleaned") = $var2 * 200
           end
       end;

EvoArt · October 8, 2021, 8:36pm

Yeah, I don’t think you’ll easily convince a proficient tidyverse user that they should jump ship for better data wrangling. Not saying it’s impossible, as it really depends on what they value in a language. But that’s not going to be a generally easy sell to ppl who are basically happy with R.

DoktorMike · October 8, 2021, 10:32pm

I fail to see your point.

clean_data <- function(df, var1, var2) {
	df[[paste0(var1, "_cleaned")]] <- df[[var1]] * 100
	df[[paste0(var2, "_cleaned")]] <- df[[var2]] * 100
	df
}
clean_data(df, "x1", "x2")

This may just be personal taste but I don’t see how this is any worse. I’m not beating on DataFramesMeta.jl, I’m just saying this is really not hard in R, and claiming that it is might not be the best way to promote julia. My two cents.

Yifan_Liu · October 8, 2021, 11:09pm

No offense. This looks much worse than R. The whole DataFrames ecosystem has much more verbose syntax than R.

jzr · October 8, 2021, 11:10pm

While we’re playing devil’s advocate…

I find Python type annotations and checkers very helpful but Julia’s annotations are inexpressive (eg no Iterable{T}) and people often warn against using them.

pdeffebach · October 9, 2021, 2:30am

That’s not the dplyr though. You might prefer Base R, for those kinds of cleaning operations, but dplyr has certainly struck a chord and lots of people like the piping verb structure.

My example above was very minimal, showing trivial transformations. In DataFramesMeta.jl you can construct much large blocks of operations, subsetting, filtering, split-apply-combine, etc. dplyr provides great syntax for these large blocks, but you have to sacrifice some flexibility working with things programmatically.

pdeffebach · October 9, 2021, 2:33am

That’s fair criticism. Some of the things that R does to make syntax easier, like avoiding :x and @, aren’t possible in Julia.

I would argue there are some benefits to expliciteness

The @ let’s the user know there is “magic” happening in the syntax.
The :x for column references is nice because it lets users know a variable is a column in the data frame and not some existing variable
@rtransform and @transform lets users be explicit about row-wise vs. column-wise operations.

This comes at a cost of increased ugliness for sure, though.

jling · October 9, 2021, 2:54am

now, try to do this weird 10% of computation that can’t quite be done with the existing columnar (C/Fortran backend) functions. No big deal, you say, and wrote a for loop, and the performance is horrible.

You don’t pay 80% of the cost dealing with the last 20% of niche/DIY tasks if you picked Julia.

DoktorMike · October 9, 2021, 6:19am

For sure, I prefer dplyr too for many things. And this is of course solvable there too. I’d just do this.

clean_data <- function(df, var1, var2) {
	df %>% mutate("{{var1}}_cleaned" := {{var1}} * 100,
		          "{{var2}}_cleaned" := {{var2}} * 100)
}
clean_data(df, x1, x2)

And it used to be an issue in R when working with data.frames programmatically, but I feel this is no longer the case.

You can also mix and match if you’d like

clean_data <- function(df, var1, var2) {
	df %>% mutate(across(all_of(var1), function(x) x*100, .names = "{.col}_cleaned")) %>%
		  mutate(across({{var2}}, function(x) x*100, .names = "{.col}_cleaned"))
}
clean_data(df, "x1", x2)

Either way there’s a lot of ways of working with this in modern days in R. You can read more about it by calling vignette("programming", "dplyr") in R.

jules · October 9, 2021, 6:50am

I think it’s fair to say that you can do everything you can do with DataFrames / DataFramesMeta with dplyr, if you know how. Looking at your examples there are definitely different underlying principles at play how these libraries are designed. My personal preference is a bit towards DataFramesMeta because the macros are translatable to function calls and those function calls then follow normal Julia rules. Your example shows a few different examples of “magic” which can both delight and confuse, for me it was usually more confusing when I was using R. But that seems mostly to be a matter of taste, which is hard to argue about.

DoktorMike · October 9, 2021, 6:51am

Completely agree. It’s really about preference in the end. That’s why I normally advocate julia in performance areas rather than data wrangling. In the end we’re all trying to advocate julia the best way we can.

Topic		Replies	Views
DataFrames.jl - Choosing between the core functions and available libraries (Query.jl, DataFramesMeta.jl, etc) Data	10	2043	September 15, 2018
A living post of Julia vs R's data manipulation tasks speeds Data data	21	7687	August 27, 2021
What have we learned from DataFrames in Julia? Community poll	4	1630	November 29, 2017
Shout out to JuliaConnectoR and DataFrames.jl / Tables.jl Offtopic appreciation	0	477	April 26, 2022
Difference between JuliaDB and DataFrames Data	13	1852	June 17, 2021

Things that are easier in Julia than Python/R etc

Related topics