Julia vs R vs Python

r3tex · July 21, 2017, 11:00am

Hi! I’ve written a small blog post for my acquaintances here in Stochkolm highlighting that the performance difference across Julia, R, and Python might not be negligible. The problem is that for the example I chose, Python and R are not just a bit slower, they’re ridiculously slow in comparison, and this isn’t even a “showcase” algorithm for Julia. I just want to know if there’s some glaring error in the Python and R versions.

https://github.com/LakeTide/share/blob/master/juliaperf/JuliaVsPythonVsR.ipynb

Per · July 21, 2017, 12:05pm

Well, of course Julia is going to be the fastest when you’re rendering the Julia set. To be fair to the other languages you should try rendering the Python set or the R set…

cstjean · July 21, 2017, 12:48pm

All fundamentally-iterative algorithms are “showcase” for Julia, mostly because they perform terribly in interpreted languages. To be fair to Python numerical users, you might try rewriting fractal() with Numba. Or pyjulia

stevengj · July 21, 2017, 1:38pm

The Julia version can even be made a lot faster (more than a factor of 3 on my machine) by replacing the abs(z) > 2 check with abs2(z) > 4 to avoid the hypot call (which involves expensive square root and overflow checks).

(There are a few other opportunities for performance improvement in the Julia code, e.g. using z = Complex(3*(x*winv-0.5), 3*(y*hinv-0.5)) to avoid the divisions, but I only gained an additional 7%, so I wouldn’t bother.)

Note that your fractal function should probably return lod+1 at the end (or return i-1 in the loop, or even better loop from 0:lod-1). Otherwise, it returns lod if either abs(z) > 2 on the last iteration or if the loop completes.

Tamas_Papp · July 21, 2017, 1:57pm

To be fair: most R users who dipped their toes into some basic programming (as opposed to using canned algorithms) know this. R was never sold on its speed.

stevengj · July 21, 2017, 2:01pm

A lot of people vaguely know that writing inner loops in Python/R/Matlab is slow, but they aren’t aware that they are often sacrificing 1–2 orders of magnitude. I remember showing a colleague’s group this benchmark graph, and after a few seconds his jaw dropped and he said … “Wait, that’s a log scale?!?”

(It’s also important to know why Python etc. loops are so slow. See this notebook for benchmarks and discussion of a simple sum function comparison and discussion of the role of types.)

stevengj · July 21, 2017, 2:17pm

There’s a big difference between vaguely knowing that something is suboptimal and knowing how slow it really is. In my experience, the vast majority of users/programmers don’t know the latter, because they never compare performance against a truly well-optimized implementation in a high-performance language.

(If they had a decent library implementation of a function, most users would never write their own. And if they do optimize their code, most users are happy if they get a factor of 2 or 4, and never suspect that they might be leaving another factor of 10 on the table. They often assume incorrectly that a vectorized version of their code is optimal. Comparing to library routines is difficult anyway since you don’t necessarily understand what algorithm the library uses. Only a tiny fraction of R/Python/Matlab users ever experience rewriting a critical algorithm in C or similar.)

At the root of this problem is that most programmers have no idea how a computer actually works, and have no idea why Python loops are slow. They think “it’s interpreted, not compiled” is the beginning and end of the story. Or worse, I often find that scientists’ mental model of a CPU is a box that does floating-point arithmetic at a fixed rate, and so they don’t understand how code could run much faster without reducing the number of floating-point operations.

Tamas_Papp · July 21, 2017, 2:36pm

You make a valid point, but there are always trade-offs. Many scientists would rather devote their attention to doing science, instead of learning about computer architecture. The fact that nowadays you need to program computers to do certain (most?) kinds of science is already a large mental cost. Members of the Julia community already like programming computers and fiddling with code, so our view is distorted, but for many scientists it is a nuisance.

If Julia can bring “within 10x of C with a few simple rules” to the masses, and leave “1-2x C” to the experts who are willing to dig deeper (and learn about things you mentioned), it will have advanced scientific programming considerably.

RoyiAvital · July 21, 2017, 2:55pm

I wrote this code on MATLAB (R2017a):

dtlLvl  = 1000;
numRows = 6000;
numCols = 6000;
mBitMap = zeros([numRows, numCols]);

paramGoldenRatio = (1 + sqrt(5)) / 2;
paramC = (paramGoldenRatio - 2) + (paramGoldenRatio - 1) * 1i;

tic();
mBitMap = JuliaFractal(mBitMap, numRows, numCols, paramC, dtlLvl);
toc();

figure();
imagesc(mBitMap);


function [ mBitMap ] = JuliaFractal( mBitMap, numRows, numCols, paramC, dtlLvl )
%JULIAFRACTAL Summary of this function goes here
%   Detailed explanation goes here

for jj = 1:numCols
    for ii = 1:numRows
        valZ = complex((3 * (jj - numRows / 2) / numRows), (3 * (ii - numRows / 2) / numRows));
        mBitMap(ii, jj) = CalcFractal(valZ, paramC, dtlLvl) / dtlLvl;
    end
end


end


function [ dtlLvl ] = CalcFractal( valZ, paramC, dtlLvl )
%JULIAFRACTAL Summary of this function goes here
%   Detailed explanation goes here

for ii = 1:dtlLvl
    if(abs(valZ) > 2)
        dtlLvl = ii;
        return;
    else
        valZ = (valZ * valZ) + paramC;
    end
end


end

On my Computer (Intel i7 6800K @3.4 GHz) it takes ~40 [Sec].
So it is ~x10 Slower than Julia.

MATLAB is getting better and better on doing this kind of things (Looping) leaving Python and R behind and getting closer to Julia.

@stevengj, Does Julia vectorize the code posted by @r3tex?
If it does, it explains what MATLAB doesn’t do (But to the least doesn’t add overhead).

stevengj · July 21, 2017, 3:34pm

No. Vectorization has absolutely nothing to do with the performance differences here. (One of the bad results of programming too much in Matlab etc. is that it falsely teaches one to view “vectors good, scalars bad” as the central “fact” of performance.)

Matlab’s compiler is indeed getting better, and Python compilers are getting better too (e.g. it would be interesting to try PyPy with the OP’s example). But the semantics of those languages make them intrinsically hard to compile to fast code, and even harder to compile reliably (i.e. so that you can look at a piece of code and have a good sense of how well the compiler can do on it), especially for type-generic code (i.e. code not limited to working on a few “built-in” array and scalar types), so it will be hard for them to ever fully catch up to Julia.

RoyiAvital · July 21, 2017, 3:44pm

@stevengj, How far down the road will Julia automatically vectorize the above code?
Are we far from there?

Is there a simple way to vectorize it now in Julia?

stevengj · July 21, 2017, 3:45pm

Why do you want to vectorize it? Vectorizing it in the Matlab sense (i.e. trying to compute the whole Julia set at once using array operations) will almost inevitably make this particular algorithm slower. Or are you referring to SIMD?

RoyiAvital · July 21, 2017, 3:51pm

Yes, from the beginning I was talking about generating vectorized CPU code not vectorized in the sense of vextorized MATLAB Code.

So is the code of @r3tex is vectorized by Julia for the CPU?
If it is not, when will Julia be doing it automatically?
Is there a way in Julia 0.6 to make Julia generate vectorized code?

Thank You.

stevengj · July 21, 2017, 4:07pm

This is mainly an LLVM issue and not a Julia issue. Yes, LLVM can automatically SIMD-vectorize code in some cases. You can enable SIMD for a particular loop by using the @simd macro, and you can also turn on SIMD and other optimizations by running julia -O3.

However, it isn’t always successful. For example, on the fractal code above, @simd made no difference and julia -O3 actually makes it slower on my machine, not faster. (Again, an LLVM issue not a Julia one.) This particular loop is going to be hard to automatically SIMD vectorize because of the dependencies in the innermost loop, and the fact that different z require different numbers of iterations in the innermost loop.

(SIMD vectorization has absolutely nothing to do with the language performance differences in this thread!)

To get better SIMD vectorization of this kind of fractal code is tricky, I’m guessing that for the next few years you would have to do things manually e.g. via SIMD.jl. (I suppose you could restructure the loops to speculatively execute multiple iterations at once, or iterate for multiple z at once, and then backtrack once escape is detected. I seriously doubt that any compiler will ever be able to do this for you in the foreseeable future.)

r3tex · July 21, 2017, 7:53pm

Thanks for all your answers! I’ll incorporate many of the learnings from here and the included links, and am especially happy for the big performance improvement from using abs2.

FYI, I just tested using PyPy and it runs in about 11 seconds. ~5x slower than Julia

stevengj · July 21, 2017, 8:52pm

(Note that PyPy is, unfortunately, pretty hard to use for technical work because it is incompatible with many of the key scientific libraries, from SciPy to TensorFlow. One of the key difficulties for Python, Matlab, and R compilers is that it is hard for them to make things faster without breaking backwards compatibility.)

RoyiAvital · July 22, 2017, 6:56am

11 Seconds?
That’s really nice!

I think it is only ~2.5 slower than Julia (As I think PyPy is single threaded).

@stevengj, I wish someone would reconsider the compatibility thing.
Really, move forward, make things move forward.
Compatibility isn’t the most important thing as long as you keep a way to run old code.

For instant, Mathworks have MCR of all versions.
Since it is there, I would be happy if they break compatibility in the name of performance.

John_Hearns · July 22, 2017, 10:27am

Steven, Bullseye.
Panasas (storage company) haev an excellent term - Time to Solution. That is what counts - how fast you can get the results for your bank’s overnight risk analysis run/ the simulation which will win your Nobel Prize/rendering your Youtube cat movies. (Delete as appropriate).
I will counter with ‘Good enough for Jazz’ - I guess most of us mortals start by writing a ‘quick and dirty’ script which is only meant to be run once, then ends up taking on a life of its own.
(A cat just came to sit on my keyboard - cat kind are taking revenge for my remark above)

I guess though - stretching the Jazz analogy, the piano IS important and Julia gives is a good piano to start with.

StefanKarpinski · July 22, 2017, 8:48pm

I can’t tell from this if you’re encouraging the Python community to drop compatibility and move forward with PyPy (this is very unlikely to happen since the core developers remain committed to CPython as the implementation of Python), or if you’re exhorting us (the Julia community) to reconsider some compatibility thing.

On the subject of compatibility and how language design choices make it hard for some languages to run fast, it seems to be generally very poorly understood just how hard the designs of languages like Python and R are for really high performance implementations. If you’re interested, here are two talks on the subject:

Armin Ronacher (one of the creators of PyPy), “How Python was Shaped by leaky Internals”:

Jan Vitek, “Making R run Fast”:

This talk contains the following key quotation by Jan:

I haven’t seen a language that is as hard to optimize as R. Somebody should give the R designer that particular prize as probably the hardest language to optimize… Compared to R, JavaScript is a beauty of elegance and good design.

The title of this talk is largely aspirational (they haven’t made R run much faster), although Jan’s research lab has a variety of projects trying to improve R’s performance.

RoyiAvital · July 22, 2017, 9:03pm

I meant that compatibility is over rated.
Sometimes it gets in the way of the real object of the project and becomes the object by itself.

For instance you can see Intel is now starting to pay for its fanatic compatibility religion when new architecture which doesn’t hold on their shoulders ~35 years of compatibility and taking advantage of lessons learned are kicking it.

So my point is that I hope, in the future, if compatibility issues stand in the way of Julia to achieve great performance improvements the choice would be to freeze a version which is compatible for anything until this day and create a new point.

I understand it should be something which promise great improvement (No one would break is for meaningless improvement), I just hope compatibility won’t become religion and the main objective of the project.

If Julia objective is to break the 2 languages problem, it means it should do what’s needed to achieve that.

Topic		Replies	Views
Julia motivation: why weren't Numpy, Scipy, Numba, good enough? Community history	123	83432	September 21, 2018
Results regarding Julia from HackerRank developer skills report Community	26	3386	January 28, 2018
Comparing with Python General Usage	56	4831	May 30, 2018
R and Python together in RStudio Offtopic	31	4548	December 24, 2019
Ensure Julia is used to its full power New to Julia question , first-steps	100	7040	December 11, 2020

Julia vs R vs Python

Related topics