I’m putting together an Intro to Julia Jupyter notebook and in my section on Why you should learn Julia I’m emphasizing Julia’s performance. This notebook is for colleagues of mine who primarily use R and a few that use Python. I’m including a few lines of simple benchmarks that look like this:
using BenchmarkTools
using PyCall
using RCall
a = rand(10^7)
@btime pybuiltin("sum")(a)
@btime R"sum($a)"
@btime sum(a)
When you execute this code, Julia absolutely obliterates Python and dramatically outperforms R as well. However, the question that I will undoubtedly get is, “How can I be sure that it doesn’t take longer to execute Python/R code via the PyCall/RCall packages?” or something along those lines. People will obviously want to know if this is a fair way to compare speeds and I simply don’t know enough about the way these packages work to answer those questions.
Does anyone here know if this is a fair comparison or if there are indeed additional processes taking place given that a was instantiated in Julia but is being operated on in the other language (or for some other reason)? Aside from telling them to measure the speeds themselves in their normal working environments, is there a good way to convince a skeptical crowd that these are legitimate comparisons?
julia> using BenchmarkTools
julia> a = rand(10^7);
julia> @benchmark sum($a)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.706 ms (0.00% GC)
median time: 4.215 ms (0.00% GC)
mean time: 4.229 ms (0.00% GC)
maximum time: 5.408 ms (0.00% GC)
--------------
samples: 1180
evals/sample: 1
R
> library(microbenchmark)
> a <- runif(1e7)
> microbenchmark(sum(a))
Unit: milliseconds
expr min lq mean median uq max neval
sum(a) 8.633446 8.64609 8.781826 8.700741 8.792563 10.75872 100
Python
In [10]: import numpy as np
In [11]: np_a = np.random.rand(10**7)
In [12]: a = np_a.tolist()
In [13]: %timeit np.sum(np_a)
4.08 ms ± 23.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [14]: %timeit sum(a)
35.8 ms ± 157 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Performance is nice, but it’s not the primary reason I use Julia. Multiple dispatch, the design of the type system, the support for functional programming, and the ecosystem of numerical and scientific computing packages are what draw me to the language.
Converting Julia Arrays to Python Lists takes some time because they have a completely different memory structure. However, passing Julia Arrays as Numpy Arrays is usually very fast.
The PyCall overhead is in my experience <<1ms if no significant amount of data is transferred. To be on the safe side, I suggest to cross-check the Python and R benchmarks using native Python/ R notebooks.
I did a comparison of Julia to Python for DataFrames, maybe this is useful for you: