Things that are easier in Julia than Python/R etc

For sure, I prefer dplyr too for many things. And this is of course solvable there too. I’d just do this.

clean_data <- function(df, var1, var2) {
	df %>% mutate("{{var1}}_cleaned" := {{var1}} * 100,
		          "{{var2}}_cleaned" := {{var2}} * 100)
}
clean_data(df, x1, x2)

And it used to be an issue in R when working with data.frames programmatically, but I feel this is no longer the case.

You can also mix and match if you’d like

clean_data <- function(df, var1, var2) {
	df %>% mutate(across(all_of(var1), function(x) x*100, .names = "{.col}_cleaned")) %>%
		  mutate(across({{var2}}, function(x) x*100, .names = "{.col}_cleaned"))
}
clean_data(df, "x1", x2)

Either way there’s a lot of ways of working with this in modern days in R. You can read more about it by calling vignette("programming", "dplyr") in R.

1 Like

I think it’s fair to say that you can do everything you can do with DataFrames / DataFramesMeta with dplyr, if you know how. Looking at your examples there are definitely different underlying principles at play how these libraries are designed. My personal preference is a bit towards DataFramesMeta because the macros are translatable to function calls and those function calls then follow normal Julia rules. Your example shows a few different examples of “magic” which can both delight and confuse, for me it was usually more confusing when I was using R. But that seems mostly to be a matter of taste, which is hard to argue about.

4 Likes

Completely agree. It’s really about preference in the end. That’s why I normally advocate julia in performance areas rather than data wrangling. In the end we’re all trying to advocate julia the best way we can. :slight_smile:

5 Likes

Sorry to keep harping on this, but I still don’t think this is 1:1.

To use variables representing column names, such as v1 = "x1", you have to use across, mutate_at, or some other construct which is not the nice $var = f($var) syntax which is the selling point of dplyr. This is my understanding from the vignette and from the snippet below.

The first expression is the one which you can use programmatically, i.e. inputting v1 = "x1", but you have to use all_of instead of a keyword argument.

All of the examples in the programming vignette work with symbol literals.

I really like dplyr and base R for data wrangling. But I really think easy prototype-to-function switches are a highlight of DataFramesMeta.jl, and I hope it can get people to good programming habits quickly since the change in code to support programmatic use of columns is so small, just add $.

Your notebook is amazing. There is so much to learn from it. The animations of error propagation in a running simulation are fascinating.

3 Likes

To focus on something concrete, you could probably provide an example where broadcasting loop fusion improves the performance of code compared to the equivalent code in R. You could steal the following example and add an R microbenchmark comparison:

1 Like

isn’t R::data.table the standard package to go?

Download stats would suggest the opposite:

Source: R pkg download stats - ipub

Personally I use dplyr daily as part of tidyverse, and I can’t bear the data.table syntax. Nowadays if I want speed in R I drop in some Julia code via JuliaCall

1 Like
using StatsPlots, Distributions
# Mathematical notation is used in code
# It's much easier to follow along with equations from a
# paper or book, when your coded functions look the same
function Beta_params(μ,σ²)
	# This function takes mean and variance to parametise a 
	# Beta distribution
    ν = (1 - μ)/σ² -1
    α = μ*ν
    β = (1 - μ)*ν
    return Beta(α,β) # The function returns an object, 
					 # which is a Beta distribution.
end
# Even though the function was defined for a single set
# of parameters, by using the "." notation, Julia
# knows to broadcast the function over 
# my input (in this case two vectors).
Dists = Beta_params.(LinRange(0.4,0.2,20),LinRange(0.1,0.05,20))
# plotting the distibution (or 20 in this case, is as simple as...
plot(Dists)

beta

14 Likes

I disagree and I would take the R syntax any day for data wrangling. I’m just letting you know that your argument about Julia’s strength in data wrangling would have pushed me further from the language. Not closer. You are of course welcome to use this or not. :slightly_smiling_face:

Tidyverse is a giant hot mess thanks to it’s frequent reliance on what amounts to Fexprs (nonstandard evaluation) but for the people who like it they are usually completely unaware of the issue and also know various workarounds. It certainly isn’t a great way to “convince” anyone who enjoys the Tidyverse. For those of us who hate it with a bright burning white passion, Julia is an enormous breath of fresh air so it seems like a good argument, but it really would only be successful if you were discussing the issue with like minded people.

9 Likes

If you like to develop a decent package for python you need to solve three language problem, python, c(or similar thing) and interoperability of c and python. in Julia you need Julia alone!

comparing languages’ syntax is silly(because you always like the one which you are familiar with), but performance and feature are critical and in this respect Julia can be amazing

isn’t R::data.table the standard package to go?

I like and use data.table. It’s done a lot to bring speed to R and is a much needed substitute for dataframes. I also like and use ggplot2 and find it hard to beat at what it does. Other than ggplot2, I do not much use the other tidyverse packages. I enjoy piping once in a while, but not all that much. When it was released a few years ago, I expected data.table to become the new standard, but that doesn’t seem to have happened (yet). Instead it seems R users love all the non-standard tidyverse tricks. Probably the RStudio team’s enthusiasm has had a lot to do with it. So no, data.table is not the standard it arguably ought to be.

1 Like

For an average R or Python user, Julia gives you the ability to develop large-scale, memory hungry projects like multi-agent simulations or gigantic portfolio optimizations that would be harder to put together in R or Python (you’d probably have to use Rcpp and such).

I’m still using R for plots and Python for web-scraping, just because I have existing code I can copy-paste at great speed and get things done. But all my “serious” projects right now are in Julia.

It’s a bit like your Ferrari: You take it to a circuit when you want speed, but you go shopping with the mini. :grimacing:

1 Like

I’m a bit surprised by this, seeing the examples in this thread. Here’s a summary:

Example 1:

# R
clean_data <- function(df, var1, var2) {
    df %>% mutate_at(vars(var1), function(x) x*100) %>%
           mutate_at(vars(var2), function(x) x*200)
}

# Julia
function clean_data!(df, var1, var2)
    @chain df begin
        @rtransform! $var1 = $var1 * 100
        @rtransform! $var2 = $var2 * 200
    end
end

Example 2:

# R
clean_data <- function(df, var1, var2) {
	df %>% mutate("{{var1}}_cleaned" := {{var1}} * 100,
		          "{{var2}}_cleaned" := {{var2}} * 100)
}

# Julia
function clean_data!(df, var1, var2)
    @chain df begin
        @rtransform! $(var1 * "_cleaned") = $var1 * 100
        @rtransform! $(var2 * "_cleaned") = $var2 * 200
    end
end

Example 3:

# R
clean_data <- function(df, var1, var2) {
	df %>% mutate(across(all_of(var1), function(x) x*100, .names = "{.col}_cleaned")) %>%
		   mutate(across({{var2}}, function(x) x*100, .names = "{.col}_cleaned"))
}
clean_data(df, "x1", x2)

# Julia (edited, see comments below)
function clean_data!(df, var1, var2)
    @chain df begin
        @rtransform! $("$(var1)_cleaned") = $var1 * 100
        @rtransform! $("$(var2)_cleaned") = $var2 * 200
    end
end
clean_data!(df, "x1", :x2)

Example 4:

# R
df[[paste0(var1, "_cleaned")]] <- df[[var1]] * 100

# Julia
df[:, var1*"_cleaned"] = df[:, var1] * 100

Conclusion:

One thing that looks worse in Julia is the need for @chain df begin.

Otherwise what strikes me is the amount of specialized functions and syntax that’s needed in R. And every time the problem is a bit different, the code gets replaced with a very different solution. Look at the new things we need to learn as we move from one example to another (I don’t count standard language syntax like R function(x)... lambda in the first example and Julia indexing in the last example).

Example 1:

R: mutate_at and vars

Julia: @rtransform!, = and escaping with $.

Example 2:

R: "{...}", {...} and :=.

Julia: -

Example 3:

R: across, all_of, .names and {.col}.

Julia: :col can be used instead of "col"

Example 4:

R: [[...]]

Julia: -

Of course there is a big thing to learn with DataFrames.jl which is not shown here: the => minilanguage. There’s a real learning curve but it often gives my favorite solution:

transform(df, var1 => (x->100x),
              var2 => (x->200x), renamecols=false)
              
transform(df, var1 => (x->100x) => var1*"_cleaned",
              var2 => (x->200x) => var2*"_cleaned)

So clear and consistent :slight_smile:

17 Likes

Not sure if this would be the best analogy, given Julia’s versatility and open-sourceness :slight_smile:

The Lego analogy by @ElOceanografo seemed more accurate…

Agreed. I don’t really know what a finance workflow looks like, but if someone needs to wrangle data and do basic stats and they are already proficient and happy using R/Python then I’m not sure they need to learn another programming language.

1 Like

It is also a different thing to learn a programming language and to learn how to use a package that was written to/for that programming language. For example, I don’t know how to do any of the data frame stuff exemplified here in any of the languages mentioned, including Julia. Making a case for Julia is not the same thing as making a case for the DataFrames.jl package vs. other alternatives.

If one is only using the interfaces of a few packages, the underlying language is not really relevant.

(though syntaxes compared, the prettiest one is that one @sudete has shown :slight_smile: )

To make the case for Julia the point is to show how those functions and data structures compose with other code, possibly custom ones, while retaining performance and some consistency.

7 Likes

My two cents about this, being a relatively new Julia user and a relatively old R user.

The tidyverse developers have thought very well through the daily needs of a data scientist and I enjoy working with their libraries and their very well designed functions. And fair enough, being a proficient user of the tidyverse ecosystem, one maybe reluctant to adopt Julia for data cleaning and wrangling in general.

After getting used a bit to Julia’s respective functions and mindset based on the tools mainly relevant to DataFrames.jl, I feel that that there is one thing that Julia offers in a better way: transparency. What I mean by that is that through the Pair-like input-output operations while select()ing and transform()ing, the AsTable() and the ByRow() functions make the adoption process very smooth and the most important is that they make you feel that you have better control over the input and the output data and mesh well with the brain process of “seeing” data. Why is that important? I realized that instead of just thinking what I want to do and then try to use programming to accomplish that, as happens when I use R, what you want to do is somehow promoted by the tools you have in Julia. This type of transparency helped me think of different ways of shaping the data or of getting different various (custom) types of data.

Another thing that misses from an R user’s mindset closely relevant to cleaning and analyzing data is that in Julia there are many ways of doing the thing you want to do. That is something I had heard lots of times since the first tutorial I read, but I never really realized the power behind that mindset, until I discovered how many ways you have to anayze categorical data, either with Dictionaries, OrderedCollections, FreqTables or DataFrames.

4 Likes

While I agree with your premise I would not condone the practice. In general my advice would be to advocate the love for Julia rather than the hatred for another language. NSE especially is quite useful to me, and I would think many others, for some tasks. This does in no way prevent me from using and appreciating julia as well as the design decision to not support it.

1 Like