What types of values were in the data you were using for your benchmarks?
(This is of great interest to me, as I’d like to make sure that when it comes to string handling performance, Julia can at a minimum match the fastest in other languages, and I think with a bit of work, Julia’s performance can blow the others away)
What types of values were in the data you were using for your benchmarks?
I studied group-by strategies in my post and I have used 4 threads to beat data.table by 30% (could be my lack of threading skills also).
The hard case is strings. But I have yet to quantify how much is due to not having “interned” strings in Julia by default vs R. String is also an issue for Python/Panda so Julia is not alone.
Other examples are: data.table has fast merging algorithms, fast reading of csvs and fast lag/lead operations etc to name a few.
Yeah, but I think there might be “edge” cases. E.g. R and Python uses a lot of C, so Julia vs R is really Julia vs C via R. There is no guarantee that Julia can be faster than C. But of course, observing from the outside, Python and R are hampered by issues like not having multithreading (which is a big deal for modern PCs with multi-core CPUs), GIL, and type-uncertainty (Julia has this as well but ensuring type-stability will speed up code in Julia).
From my research so far having fast access to underlying binary representation and efficient memory layout of strings (& interning?) is key to unlocking fast string operations. Ideally having fast strings should be a default and something that a non-computer-scientist like myself needn’t have to think about; so really appreciate the work you are doing in
BTW for really short strings (4 bytes), I got radixsort in Julia to be within a factor of 0.8 to 1.1 of R for
length(string_vector) < 100_000_000. So it’s progress already.
That is true, but what won me over (2.75 years ago, evaluating Julia vs. R vs. Octave vs. Python (variants) for work) is that very quickly (after some helpful performance tips from the Julia community, after I’d only been trying Julia for a few weeks) was that I was able to get the basically the same performance as C, in many fewer lines of code (and more readable lines). The code was generic, and actually produced code optimized for the different types being passed (
UInt32). I would have had to write 3 functions, with different names, for each type I wanted to handle, and hand-optimize each of them, in C.
So the productivity gain from using Julia, while not losing the performance that I need, has been critical for me.
I might rephrase that:
There is no guarantee that C can be faster than Julia.
That’s true, and it’s also true that we still have to write the good algorithms too.
And to me, that’s one of the (many!) great things about Julia. I can spend more time creating better data structures and algorithms (for the most part) in Julia, instead of spending my time optimizing many different cases separately. The combination of the type system, multiple dispatch and the compiler JIT/AOT generating specialized versions of methods, is just soooooo wonderful.
Happy New Year!
Look Julia is a fantastic language but R has its own strengths.
Base R was written by a statistician so base r has always been slow.
But people started building packages over it to make it faster and faster and faster. If you know which package to call for speed you would not say these things.
data.table is fastest for data frame computation.
Matter is fastest for matrix calculation
Base plot function is faster than ggplot2
Speedglm for regressions
Rfast for almost everything
Rcpp for system level programming
And so on and so forth…
Even after this they have implemented altrep and byte compilation in R. With projects like fastR and Microsoft R. Sooner or later speed wouldn’t be any significant advantage.
Because 95% of the R users, including me, use it interactively and thus microseconds don’t even matter that much. It’s the number of packages and calculations and ease of use of the language that makes it different.
But I do understand its a statistical platform and it has limitations.
Like I can’t build websites
Or apps or hack a network or anything like system level stuff and something more.
I like Julia. I am trying to learn it. It’s a beautiful language. But R is something I have been using for like 5 years and it has served me well. It’s a very nice language with more than 14k packages including bio conductor and github. Even python doesn’t come close to its statistical prowess.
So please refrain for directly comparing to R. It’s a niche language and is the best in the segment.
If all the performance-critical code you will ever need has already been written for you in the form of a package, then you’re right that you don’t need to care about the language performance so much. Programming would be a lot easier if that were true, but if you do enough technical computing then before long you run into a problem where performance matters and the existing libraries are inadequate, and then you hit a wall in a language like R.
The semantics of R (and its data structures and standard libraries) make it intrinsically difficult to write effective general-purpose compilers for it, so the situation is unlikely to be improved any time soon. Python and Matlab face a similar issue issue; a huge amount of resources have been poured into compilers for them, but still only a narrow range of code can run fast.
Check how many of the packages that you mention are written in R. I assume the answer is zero.
Roughly 2/3 is written in C. I’d assume most of the 1/3 of R is actually glue code.
Also, note that you can call these packages (maybe with some overhead) from Julia as well.
You might also want to check Bio.jl, even though it’s probably not as complete as the R ecosystem and still WIP, I find it much nicer to work with for bioinformatics.
For example in R if you want to read a FASTQ file you need to install ShortRead and use the function
readFastq, if you want to read a BAM you need to install Rsamtools and use the function
scanBam, GFF ? go with
read.gff. You get the idea.
Bio.jl you have a single, generic interface to read and write files. E.g.
open(FASTA.Reader, file) open(BAM.Reader, file)
I’m sure you can guess how you open a GFF file. And of course it’s also faster, easier to understand and extend, has some proper structs with generic methods to hold your data instead of lists of lists of lists, etc.
I also find the same thing is true for statistics, which is the other domains where R is supposedly very good at.
Julia has never really positioned itself against all other languages at all costs — instead, it bootstrapped itself through strong interop to stand on the shoulders of giants where giants be stood.
First, you can use R to make websites using Shiny.
Second, other implementations of R are still slower than Julia, see for example:
Third, the speed difference between R and Julia is far more than microseconds…
I really like data.table, hopefully it will add out-of-memory support in the near future so that R can kick out SAS for good. Julia is aiming at a different niche, but that does not mean we should not say it has some better language features over R…
Thanks for replying on the post but I am not trying to offend either of the groups. I am just trying to give a perspective which is pro R.
@stevengj I totally agree with your point. But on those occasions rcpp is a great candidate. And it’s fast too…
Speed matters only when you are trying to build something for the use of other people. When you just need interaction. It doesn’t much. I wish Julia become the next default for stats.
@crstnbr you are thinking like a programmer and not an analyst. We need to get the job done. Doesn’t matter if its SAS or Excel or python underneath. Even Node.js is written in c++. It’s default for languages to go to c++. Julia has some code written in python too. Things like this doesn’t matter. Different programming languages have to find a sweet spot to live together.
@jonathanBieler thanks for this wonderful reply. I would surely go through it.
@mbauman I love both the languages and I just replied because I think no one defended R well in any of the post.
@Yifan_Liu my friend
firstly you can’t write websites like facebook, twitter, amazon in R.
Secondly mostly speed doesn’t matter in an interactive environment shree you wouldn’t even notice 2 to 3 seconds difference. Something more than that you would find a way to solve it not start learning a new language.
Thirdly today r acts as a glue language and I can throw all my computation to apache spark, hadoop , postgres, h2o and other stuffs. There are packages like modeldb which pushes regression computation to databases, dbplot which pushes graphics calculation to databases. And many more. Which means speed of R itself matters less than those of databases and computation engines.
Now please tell me julia is a good language, easy language and I would accept all your arguement. Just stop emphasizing on speed too much.
I love both the languages and I wrote these answers just to be a devils advocate. Hope you would not mind
Rcpp and Cython and similar still force you to drop down to a lower level language, which still entails a huge jump in technical complexity, a loss of programmer productivity, and a loss of ability to write generic code.
Speed matters only when you are trying to build something for the use of other people
No. Many people doing technical computing are trying push the boundaries of problem size and complexity, running calculations that take hours or days. It’s not all interactive exploration with toy problems.
This is why I switched to Julia. I kept having experiences where a graduate student would prototype something in Matlab or Python, and then once it was working we wanted to scale it up to realistic problems (3d PDEs in my case). At that point we would have to rewrite the code in C, a process often requiring many months and a huge learning curve for students unused to working in low-level languages. With Julia, performance optimization still requires some deeper understanding, but can typically be carried out with localized tweaks.
But my point is you are over emphasizing this point. Every R user knows its slow. But if your dataset and computation can fit within RAM its a great tool.
I am trying to tell you that it is a great programming language and have far many benefits than shortcomings.
Think about all the people who are not from CS background like me. Who just want to analysis some numbers. People from stats, finance, economics, biology, business, engineering and so on… It’s easier for them.
In R community speed is not the main concern. People still use dplyr instead of data.table .
All I am saying Julia is an excellent choice but R is not bad too…
That’s obviously true, since it is a very successful language, and has a large number of users. I don’t think anyone, even on these forums, would dispute that. I personally am very fond of R, and even still use it occasionally.
That said, R has a very long history (if you trace its origins from S), and has accumulated quite a lot of baggage and quirks over the years, to the point where 10 years ago, one of its original authors suggested starting over again from scratch.
These choices aren’t just about user facing functionality which can be improved by packages (such as the tidyverse), but cut to the core of the language and affect the sorts of optimisations that are possible. If you’re interested, I recommend the talk “Making R run fast” by Jan Vitek, in particular the various efforts that have been attempted over the years.
The arguments for Julia aren’t just about performance. Things like parametric types, multiple dispatch and metaprogramming I have found do make it easier and clearer to solve certain problems (in fact my first introduction to multiple dispatch was via R’s S4 classes, though it is much more cumbersome use than in Julia).
I couldn’t agree more.
I have read those articles before. And I am well aware of the shortcomings too. This is the main reason I am on this forum. Trying to learn new languages like julia, go and c++ so that I can use them where it is necessary.I know we all might have to switch from R at some point and I am trying to invest in it.
But deep down R is my first love and I can’t stand somebody berating it.
In fact I came on this forum to seek if Julia has any decent framework to build dashboards that are half as easy as shiny. And I have been following this community since then.
And I am waiting to switch my shiny codes to Julia because hosting shiny server costs a lot more money and shinyproxy is very hard to configure.
so far I am just waiting. I agree with everything you said. But I love R too… I know it has slow but in return it gives you a great flexibility. I can change the = or <- operator in R to do something different.
So I am like on both sides.
I think you are missing the fact that despite some comments which are inevitable in a context like this, berating one language or another is hardly the point of this discussion.
This topic is mostly about the various advantages and disadvantages of Julia vs other languages. If you can’t stand the discussion of R’s shortcomings, you should not be reading it.
This is very wrong. Speed is important to many people, even those who only develop code for their own use. Speed could be the difference between a usable and an unusable simulator or data processing tool.
In the data browser I have developed for my own use, 2-3 tenths of a second would make it hard to use. 2-3 seconds would be completely unusable. Would you watch a tennis match with a framerate of 0.3 frames per second?
I don’t have a CS background, and sometimes just I want to analyze some numbers. But if it’s gigabytes and terabytes of numbers (and beyond) then speed matters crucially.
It’s fine that speed doesn’t matter to you. To many it hardly matters much. But you cannot generalize that to “speed shouldn’t really matter to anyone”, which you are doing a bit, actually.
Actually I made disk.frame which has out-of-memory functionalities and I only made it while waiting for JuliaDB.jl to be ready! I will definitely benchmark it against JuliaDB.jl and Dask a bit little.
I think I started with the wrong foot. Even though I tried my best to control my tone.
Look if I am working with terabytes of data I would throw calculation to the database or spark. I would never use r for that.
Now I wouldn’t reply on it anymore. You are loosing the main point here. What about those people like me who don’t care much about speed. In academic people don’t use terabytes of data. You should have enough packages and other points to convince those folks too. You shouldn’t just rely on speed argument all the time. Languages like Haskell and wolfram are not used in majority even though they are fast. I am already convinced that Julia is good but if in future somebody else asks you that please don’t just sell it on speed.
It was a nice discussion. I learned a lot more things. Thanks a lot