Things that are easier in Julia than Python/R etc

Sorry to keep harping on this, but I still don’t think this is 1:1.

To use variables representing column names, such as v1 = "x1", you have to use across, mutate_at, or some other construct which is not the nice $var = f($var) syntax which is the selling point of dplyr. This is my understanding from the vignette and from the snippet below.

The first expression is the one which you can use programmatically, i.e. inputting v1 = "x1", but you have to use all_of instead of a keyword argument.

All of the examples in the programming vignette work with symbol literals.

I really like dplyr and base R for data wrangling. But I really think easy prototype-to-function switches are a highlight of DataFramesMeta.jl, and I hope it can get people to good programming habits quickly since the change in code to support programmatic use of columns is so small, just add $.

Your notebook is amazing. There is so much to learn from it. The animations of error propagation in a running simulation are fascinating.

3 Likes

To focus on something concrete, you could probably provide an example where broadcasting loop fusion improves the performance of code compared to the equivalent code in R. You could steal the following example and add an R microbenchmark comparison:

1 Like

isn’t R::data.table the standard package to go?

Download stats would suggest the opposite:

Source: https://ipub.com/dev-corner/apps/r-package-downloads/

Personally I use dplyr daily as part of tidyverse, and I can’t bear the data.table syntax. Nowadays if I want speed in R I drop in some Julia code via JuliaCall

1 Like
using StatsPlots, Distributions
# Mathematical notation is used in code
# It's much easier to follow along with equations from a
# paper or book, when your coded functions look the same
function Beta_params(μ,σ²)
	# This function takes mean and variance to parametise a 
	# Beta distribution
    ν = (1 - μ)/σ² -1
    α = μ*ν
    β = (1 - μ)*ν
    return Beta(α,β) # The function returns an object, 
					 # which is a Beta distribution.
end
# Even though the function was defined for a single set
# of parameters, by using the "." notation, Julia
# knows to broadcast the function over 
# my input (in this case two vectors).
Dists = Beta_params.(LinRange(0.4,0.2,20),LinRange(0.1,0.05,20))
# plotting the distibution (or 20 in this case, is as simple as...
plot(Dists)

beta

14 Likes

I disagree and I would take the R syntax any day for data wrangling. I’m just letting you know that your argument about Julia’s strength in data wrangling would have pushed me further from the language. Not closer. You are of course welcome to use this or not. :slightly_smiling_face:

Tidyverse is a giant hot mess thanks to it’s frequent reliance on what amounts to Fexprs (nonstandard evaluation) but for the people who like it they are usually completely unaware of the issue and also know various workarounds. It certainly isn’t a great way to “convince” anyone who enjoys the Tidyverse. For those of us who hate it with a bright burning white passion, Julia is an enormous breath of fresh air so it seems like a good argument, but it really would only be successful if you were discussing the issue with like minded people.

9 Likes

If you like to develop a decent package for python you need to solve three language problem, python, c(or similar thing) and interoperability of c and python. in Julia you need Julia alone!

comparing languages’ syntax is silly(because you always like the one which you are familiar with), but performance and feature are critical and in this respect Julia can be amazing

isn’t R::data.table the standard package to go?

I like and use data.table. It’s done a lot to bring speed to R and is a much needed substitute for dataframes. I also like and use ggplot2 and find it hard to beat at what it does. Other than ggplot2, I do not much use the other tidyverse packages. I enjoy piping once in a while, but not all that much. When it was released a few years ago, I expected data.table to become the new standard, but that doesn’t seem to have happened (yet). Instead it seems R users love all the non-standard tidyverse tricks. Probably the RStudio team’s enthusiasm has had a lot to do with it. So no, data.table is not the standard it arguably ought to be.

1 Like

For an average R or Python user, Julia gives you the ability to develop large-scale, memory hungry projects like multi-agent simulations or gigantic portfolio optimizations that would be harder to put together in R or Python (you’d probably have to use Rcpp and such).

I’m still using R for plots and Python for web-scraping, just because I have existing code I can copy-paste at great speed and get things done. But all my “serious” projects right now are in Julia.

It’s a bit like your Ferrari: You take it to a circuit when you want speed, but you go shopping with the mini. :grimacing:

1 Like

I’m a bit surprised by this, seeing the examples in this thread. Here’s a summary:

Example 1:

# R
clean_data <- function(df, var1, var2) {
    df %>% mutate_at(vars(var1), function(x) x*100) %>%
           mutate_at(vars(var2), function(x) x*200)
}

# Julia
function clean_data!(df, var1, var2)
    @chain df begin
        @rtransform! $var1 = $var1 * 100
        @rtransform! $var2 = $var2 * 200
    end
end

Example 2:

# R
clean_data <- function(df, var1, var2) {
	df %>% mutate("{{var1}}_cleaned" := {{var1}} * 100,
		          "{{var2}}_cleaned" := {{var2}} * 100)
}

# Julia
function clean_data!(df, var1, var2)
    @chain df begin
        @rtransform! $(var1 * "_cleaned") = $var1 * 100
        @rtransform! $(var2 * "_cleaned") = $var2 * 200
    end
end

Example 3:

# R
clean_data <- function(df, var1, var2) {
	df %>% mutate(across(all_of(var1), function(x) x*100, .names = "{.col}_cleaned")) %>%
		   mutate(across({{var2}}, function(x) x*100, .names = "{.col}_cleaned"))
}
clean_data(df, "x1", x2)

# Julia (edited, see comments below)
function clean_data!(df, var1, var2)
    @chain df begin
        @rtransform! $("$(var1)_cleaned") = $var1 * 100
        @rtransform! $("$(var2)_cleaned") = $var2 * 200
    end
end
clean_data!(df, "x1", :x2)

Example 4:

# R
df[[paste0(var1, "_cleaned")]] <- df[[var1]] * 100

# Julia
df[:, var1*"_cleaned"] = df[:, var1] * 100

Conclusion:

One thing that looks worse in Julia is the need for @chain df begin.

Otherwise what strikes me is the amount of specialized functions and syntax that’s needed in R. And every time the problem is a bit different, the code gets replaced with a very different solution. Look at the new things we need to learn as we move from one example to another (I don’t count standard language syntax like R function(x)... lambda in the first example and Julia indexing in the last example).

Example 1:

R: mutate_at and vars

Julia: @rtransform!, = and escaping with $.

Example 2:

R: "{...}", {...} and :=.

Julia: -

Example 3:

R: across, all_of, .names and {.col}.

Julia: :col can be used instead of "col"

Example 4:

R: [[...]]

Julia: -

Of course there is a big thing to learn with DataFrames.jl which is not shown here: the => minilanguage. There’s a real learning curve but it often gives my favorite solution:

transform(df, var1 => (x->100x),
              var2 => (x->200x), renamecols=false)
              
transform(df, var1 => (x->100x) => var1*"_cleaned",
              var2 => (x->200x) => var2*"_cleaned)

So clear and consistent :slight_smile:

18 Likes

Not sure if this would be the best analogy, given Julia’s versatility and open-sourceness :slight_smile:

The Lego analogy by @ElOceanografo seemed more accurate…

Agreed. I don’t really know what a finance workflow looks like, but if someone needs to wrangle data and do basic stats and they are already proficient and happy using R/Python then I’m not sure they need to learn another programming language.

1 Like

It is also a different thing to learn a programming language and to learn how to use a package that was written to/for that programming language. For example, I don’t know how to do any of the data frame stuff exemplified here in any of the languages mentioned, including Julia. Making a case for Julia is not the same thing as making a case for the DataFrames.jl package vs. other alternatives.

If one is only using the interfaces of a few packages, the underlying language is not really relevant.

(though syntaxes compared, the prettiest one is that one @sijo has shown :slight_smile: )

To make the case for Julia the point is to show how those functions and data structures compose with other code, possibly custom ones, while retaining performance and some consistency.

8 Likes

My two cents about this, being a relatively new Julia user and a relatively old R user.

The tidyverse developers have thought very well through the daily needs of a data scientist and I enjoy working with their libraries and their very well designed functions. And fair enough, being a proficient user of the tidyverse ecosystem, one maybe reluctant to adopt Julia for data cleaning and wrangling in general.

After getting used a bit to Julia’s respective functions and mindset based on the tools mainly relevant to DataFrames.jl, I feel that that there is one thing that Julia offers in a better way: transparency. What I mean by that is that through the Pair-like input-output operations while select()ing and transform()ing, the AsTable() and the ByRow() functions make the adoption process very smooth and the most important is that they make you feel that you have better control over the input and the output data and mesh well with the brain process of “seeing” data. Why is that important? I realized that instead of just thinking what I want to do and then try to use programming to accomplish that, as happens when I use R, what you want to do is somehow promoted by the tools you have in Julia. This type of transparency helped me think of different ways of shaping the data or of getting different various (custom) types of data.

Another thing that misses from an R user’s mindset closely relevant to cleaning and analyzing data is that in Julia there are many ways of doing the thing you want to do. That is something I had heard lots of times since the first tutorial I read, but I never really realized the power behind that mindset, until I discovered how many ways you have to anayze categorical data, either with Dictionaries, OrderedCollections, FreqTables or DataFrames.

4 Likes

While I agree with your premise I would not condone the practice. In general my advice would be to advocate the love for Julia rather than the hatred for another language. NSE especially is quite useful to me, and I would think many others, for some tasks. This does in no way prevent me from using and appreciating julia as well as the design decision to not support it.

1 Like

Agreed. Consistency is definitely one of the strong points that julia has going for it. Your example 3 won’t run in Julia though since I used NSE combined with variable passing. But this was also super contrived. No R user would write code in that way. :joy:

1 Like

Yes, small correction, Example three should read

clean_data!(df, "x1", :name2)

In DataFramesMeta, QuoteNodes are the equivalent of the literals used in dplyr.

1 Like

Exactly… which is why I said:

And I also agree with you when you say:

The fundamental HUGE advantage that Julia has is that the community has come together around some principles of clarity and transparency and composability. In Julia, once you’ve put in the effort to learn the “basic concepts” you are now ready to do most anything and everyone can easily see what it is that you did without reading a couple hundred pages of manuals to understand all the special functions.

This is the perfect example… Tidyverse has kind of one function for each “common thing” that data wranglers need to do. It’s written by data wranglers so they happen to know what is very common, and if you just learn that “in order to screw in an orange 4 mm torx wood screw of 13mm length you need an orange 4mm torx screwdriver for 13mm length screws” then you’re set… Julia is more of a “here’s a reversible adjustable torque-clutched lightweight high power rechargeable screwdriver and a set of every bit you could ever want for every screw ever made” approach and you put together whatever combo you need. The interoperability of the ecosystem is fantastic.

I attribute this to Julia’s ecosystem having been made by people who have both domain experience in various scientific computing fields and knowledge of programming language design and implementation. This is a special group of people. Whereas R was basically S with mostly lexical scope, and S was invented by statisticians so they didn’t have to write too much Fortran. Essentially S was the original creator of the “two language problem” and grew by accretion into about a 7 language problem (C, Fortran, base R, R + S3 OOP, R + S4 OOP, R + early Hadleyverse, R + modern Tidyverse)

Also you ~left out the functional options in Julia:~ (nevermind I see this was basically example 4)

df[:,var1*"_cleaned"] = map(myfun,df[:,var1])

Which require no special macro anything

I think you underestimate how much everyone uses weird subsets of the tidyverse. I’ve seen some quite contrived code… it just kinda depends on which subset of the tidyverse people know, and they tend to stick to that corner of it because otherwise they spend their whole time searching the toolbox for the “proper” tool. I imagine the Tidyverse as like a full size warehouse full of drawers and in each drawer is the perfect tool for one particular job… and yet the vast majority of people only know what’s in about 12 drawers, and the intersection of any two people’s set of known drawers is about 3 drawers.

11 Likes