A data scientist's thoughts on R & Python

Gunter_Faes · January 15, 2020, 4:13pm

This block by Gordon Shotwell describes his comparative thoughts on R and Python. Maybe this article is helpful to understand how the “ordinary” data scientist thinks and what he needs in his daily work? And helps to design Julia functions and packages, especially for the statistical environment?

Especially the 3rd part “The glory of CRAN” supports my arguments about packet quality which I have already made here.

lostella · January 16, 2020, 9:17pm

R is a functional programming language, which means that the natural way to accomplish something in the language is to use functions.

Zach_Christensen · January 18, 2020, 2:22pm

When I see people describe R as functional or OOP it usually seems like they’re just trying to win an argument. There are 4 iterations of class systems in core R and others provided by packages. I don’t think you can really say it’s just functional or OOP

mwsohn · January 18, 2020, 4:54pm

Very interesting reading. Is there a way to implement this R code in Julia?

fancyError <- function(df) {
  class <- class(df)
  var_name <- as.character(substitute(df))
  if (!inherits(df, "data.frame")) {
    warning(glue::glue("'{var_name}' is of class '{class}' when it needs to be a dataframe"))
  }
}
fancyError(my_var)

Zach_Christensen · January 18, 2020, 4:59pm

fancy_error(::T) where {T<:AbstractDataFrame} = nothing
fancy_error(t::T) where {T} = error("$t is of type $(T.name) when it needs to be a dataframe")

mwsohn · January 18, 2020, 5:09pm

Thank you. It’s simpler to implement this function in Julia than in R. It seems that all examples used to demonstrate benefits using R over Python in Gordon Shotwell’s blog can be simpler or easier to implement in Julia, IMHO.

joemiller · January 18, 2020, 6:11pm

This doesn’t quite work. The error message interpolates the value of t, not the name itself. I don’t see how you’d be able to do this in julia without making fancy_error a macro.

Zach_Christensen · January 18, 2020, 7:30pm

If by “name” then you mean the global variable name, you’re right. But even the provided example only works on the top level function in R. Eventually you’d need to use stack tracing or some step into the function with a debugger to track down the exact mapping from your global level variable.

joemiller · January 18, 2020, 8:53pm

Yeah. From the blog post, that’s what I thought the intention was–so users can know what familiar global object is causing the problem.

If you don’t need that, in R, you could just paste(df, "is of type...") without using any non standard evaluation.

Tamas_Papp · January 19, 2020, 7:42am

I disagree with almost everything in this blog post. Specifically,

R’s native data structures are seriously lacking: we talking about lists, and vectors with very limited element types (boolean, int, double, complex, string, “raw”), to which you can tag on metadata. Almost all of R’s “native” data structures is conventions about this metdata. This is indeed “stable”, but seriously constraining when it comes to writing organized, performant code.
Non-standard evaluation (basically, functions getting a bit of a context) was a very appealing idea when introduced originally, but it turns out not to compose well, and make efficient compilation impossible.
In theory, R is eminently suitable for functional programming (a lot of parts were inspired by the Lisp family). But in practice, higher order functions and closures in native R code almost always imply a huge sacrifice in performance, so they are not used. People usually end up coding Fortran/C++ instead and calling it from R.

All of these points are of course well known. R users just work around them — this may be a reasonable choice when R has other advantages for some application.

Turning to

I not sure this is desirable, or why it should happen. I am surprised that someone who considers himself a “professional programmer” calls the command line “bullshittery”, but it summarizes the attitude nicely.

Open source communities thrive when they have contributors who not merely users. If people are reluctant to get their hands dirty, I am not sure people will be inclined to tailor the software they write to their needs.

I think that Julia coders should write packages that they find useful and are proud of.

Gunter_Faes · January 19, 2020, 9:56am

I have been using R for many years and I also know the weak points and the performance problem is well known. In this I agree with all posts and that is my personal reason to use Julia. But on this point …

… I would like to add to my impression that a functioning and successful community also includes those members who, through good example and use of Julia in daily practice (perhaps as a data scientist?) show that Julia is a very good tool for all challenges.

I think that Julia coders should write packages that they find useful and are proud of.

Yes of course, but that says nothing about the quality.

oheil · January 19, 2020, 10:09am

And I would like to add, that the package system (CRAN and other derived systems like bioconductor) is in no way superior or more stable or whatever as it is for Julia. After years (>15) of R and bioconductor I had countless unresolvable issues with not compatible package versions.

But on the other side: I started to answer here at the very beginning but canceled it. The main reason is that those discussions R vs Python, Julia vs. R, Java vs ++ vs C#, … they are typically not very enriching and rewarding. They end with everybody has some valid points and nothing is learned. At the end it was all about taste.

Gunter_Faes · January 19, 2020, 11:54am

The main reason is that those discussions R vs Python, Julia vs. R, Java vs ++ vs C#, … they are typically not very enriching and rewarding. They end with everybody has some valid points and nothing is learned.

I agree with this if the discussion is held in this community. Nevertheless, I think that we should not close our eyes to such conversation, because these thoughts might broaden the acceptance and use of Julia. I think the use of Python and R is currently quite “overwhelming” (for data analysis). I am trying my best to change that…

Tamas_Papp · January 19, 2020, 4:09pm

I don’t think it is that much about taste — one’s decision is arguably subjective, but there are objective features of languages one can talk about meaningfully.

The problem is that these kind of discussions are mostly meaningful if all a participants are at least reasonably well-versed in both languages that are compared. Which is indeed rare. But when it happens it can be quite informative to read.

Certainly. OTOH, there are always people who feel they are entitled to high-quality, polished, and importantly free software tools, and are affronted if they are asked to contribute anything, or, heaven forbid, use a command line or isolate an MWE. We are lucky because this behavior is not very common in the Julia community. I hope this will remain so.

Gunter_Faes · January 19, 2020, 6:09pm

Oh, excuse me! I’m afraid I expressed myself clumsily and misunderstandably.

I didn’t want to give the impression that anyone in the community is just using the output and making demands. My intention was to give my impression that a community and ultimately Julia can benefit from having community members, who may not be some of the top package developers, demonstrate that Julia is successfully and beneficially used in their daily work.

tbeason · January 19, 2020, 6:48pm

Are you worried they don’t exist or something? The number of “top” package developers / core contributors is probably no more than 100 (weak estimate). Based on the last Julia Computing newsletter, Julia was downloaded ~5.5 million times last year. There are a lot of Julia users who are “just users”. I would conjecture that a nontrivial fraction of those users live on the edge of “using” and “developing”. To a large extent they just use the major packages, but if they feel up to it they might collect some useful code into a package for their own use. That is not going to be super visible to the casual outside observer, and I think that’s ok.

alejandromerchan · January 21, 2020, 6:20pm

Every time I see this conversations on this forum I wonder if anybody here used R v 0.X (was that even available?) or Python. I first saw R at like 2.4 sometime in the mid-2000’s, I believe, and I remember I had a lot of problems loading data, there were some tutorials, but not a ton, same with books. Obviously for someone getting into R today the situation is quite different and the information is everywhere.

So, we just need to keep growing as a community. We all know this. So, these discussions are interesting, but the reality is that a lot of those things are not going to magically change. Unless some company adopts Julia as their main language and start pumping serious money into the ecosystem, most work would be done people using Julia for personal projects, a lot of times on the side. So, progress will be piecemeal.

And if someone can remember R or Python in the time of versions 0.X or 1.X, I’ll appreciate any history, maybe in another thread.

complyue · January 23, 2020, 8:26am

Yes, I can add that the giant Oracle, I’d never heard it’s usage before version 8, then version 8 goes everywhere making serious money for it.

Topic		Replies	Views
R and Python together in RStudio Offtopic	31	4453	December 24, 2019
Julia vs R vs Python Community performance	106	28146	January 13, 2019
How do DataFrames.jl compare to R's? And Interoperability between R and Julia General Usage	23	6504	January 3, 2018
Data Science for Managers: Programming Languages Offtopic	11	1544	December 2, 2019
Things that are easier in Julia than Python/R etc Community python , r	60	6997	October 17, 2021

A data scientist's thoughts on R & Python

Related topics