Hi! I’ve written a small blog post for my acquaintances here in Stochkolm highlighting that the performance difference across Julia, R, and Python might not be negligible. The problem is that for the example I chose, Python and R are not just a bit slower, they’re ridiculously slow in comparison, and this isn’t even a “showcase” algorithm for Julia. I just want to know if there’s some glaring error in the Python and R versions.
Well, of course Julia is going to be the fastest when you’re rendering the Julia set. To be fair to the other languages you should try rendering the Python set or the R set…
All fundamentally-iterative algorithms are “showcase” for Julia, mostly because they perform terribly in interpreted languages. To be fair to Python numerical users, you might try rewriting
fractal() with Numba. Or pyjulia
The Julia version can even be made a lot faster (more than a factor of 3 on my machine) by replacing the
abs(z) > 2 check with
abs2(z) > 4 to avoid the
hypot call (which involves expensive square root and overflow checks).
(There are a few other opportunities for performance improvement in the Julia code, e.g. using
z = Complex(3*(x*winv-0.5), 3*(y*hinv-0.5)) to avoid the divisions, but I only gained an additional 7%, so I wouldn’t bother.)
Note that your
fractal function should probably return
lod+1 at the end (or
return i-1 in the loop, or even better loop from
0:lod-1). Otherwise, it returns
lod if either
abs(z) > 2 on the last iteration or if the loop completes.
To be fair: most R users who dipped their toes into some basic programming (as opposed to using canned algorithms) know this. R was never sold on its speed.
A lot of people vaguely know that writing inner loops in Python/R/Matlab is slow, but they aren’t aware that they are often sacrificing 1–2 orders of magnitude. I remember showing a colleague’s group this benchmark graph, and after a few seconds his jaw dropped and he said … “Wait, that’s a log scale?!?”
(It’s also important to know why Python etc. loops are so slow. See this notebook for benchmarks and discussion of a simple
sum function comparison and discussion of the role of types.)
There’s a big difference between vaguely knowing that something is suboptimal and knowing how slow it really is. In my experience, the vast majority of users/programmers don’t know the latter, because they never compare performance against a truly well-optimized implementation in a high-performance language.
(If they had a decent library implementation of a function, most users would never write their own. And if they do optimize their code, most users are happy if they get a factor of 2 or 4, and never suspect that they might be leaving another factor of 10 on the table. They often assume incorrectly that a vectorized version of their code is optimal. Comparing to library routines is difficult anyway since you don’t necessarily understand what algorithm the library uses. Only a tiny fraction of R/Python/Matlab users ever experience rewriting a critical algorithm in C or similar.)
At the root of this problem is that most programmers have no idea how a computer actually works, and have no idea why Python loops are slow. They think “it’s interpreted, not compiled” is the beginning and end of the story. Or worse, I often find that scientists’ mental model of a CPU is a box that does floating-point arithmetic at a fixed rate, and so they don’t understand how code could run much faster without reducing the number of floating-point operations.
You make a valid point, but there are always trade-offs. Many scientists would rather devote their attention to doing science, instead of learning about computer architecture. The fact that nowadays you need to program computers to do certain (most?) kinds of science is already a large mental cost. Members of the Julia community already like programming computers and fiddling with code, so our view is distorted, but for many scientists it is a nuisance.
If Julia can bring “within 10x of C with a few simple rules” to the masses, and leave “1-2x C” to the experts who are willing to dig deeper (and learn about things you mentioned), it will have advanced scientific programming considerably.
I wrote this code on MATLAB (R2017a):
dtlLvl = 1000; numRows = 6000; numCols = 6000; mBitMap = zeros([numRows, numCols]); paramGoldenRatio = (1 + sqrt(5)) / 2; paramC = (paramGoldenRatio - 2) + (paramGoldenRatio - 1) * 1i; tic(); mBitMap = JuliaFractal(mBitMap, numRows, numCols, paramC, dtlLvl); toc(); figure(); imagesc(mBitMap); function [ mBitMap ] = JuliaFractal( mBitMap, numRows, numCols, paramC, dtlLvl ) %JULIAFRACTAL Summary of this function goes here % Detailed explanation goes here for jj = 1:numCols for ii = 1:numRows valZ = complex((3 * (jj - numRows / 2) / numRows), (3 * (ii - numRows / 2) / numRows)); mBitMap(ii, jj) = CalcFractal(valZ, paramC, dtlLvl) / dtlLvl; end end end function [ dtlLvl ] = CalcFractal( valZ, paramC, dtlLvl ) %JULIAFRACTAL Summary of this function goes here % Detailed explanation goes here for ii = 1:dtlLvl if(abs(valZ) > 2) dtlLvl = ii; return; else valZ = (valZ * valZ) + paramC; end end end
On my Computer (Intel i7 6800K @3.4 GHz) it takes ~40 [Sec].
So it is ~x10 Slower than Julia.
MATLAB is getting better and better on doing this kind of things (Looping) leaving Python and R behind and getting closer to Julia.
No. Vectorization has absolutely nothing to do with the performance differences here. (One of the bad results of programming too much in Matlab etc. is that it falsely teaches one to view “vectors good, scalars bad” as the central “fact” of performance.)
Matlab’s compiler is indeed getting better, and Python compilers are getting better too (e.g. it would be interesting to try PyPy with the OP’s example). But the semantics of those languages make them intrinsically hard to compile to fast code, and even harder to compile reliably (i.e. so that you can look at a piece of code and have a good sense of how well the compiler can do on it), especially for type-generic code (i.e. code not limited to working on a few “built-in” array and scalar types), so it will be hard for them to ever fully catch up to Julia.
@stevengj, How far down the road will Julia automatically vectorize the above code?
Are we far from there?
Is there a simple way to vectorize it now in Julia?
Why do you want to vectorize it? Vectorizing it in the Matlab sense (i.e. trying to compute the whole Julia set at once using array operations) will almost inevitably make this particular algorithm slower. Or are you referring to SIMD?
Yes, from the beginning I was talking about generating vectorized CPU code not vectorized in the sense of vextorized MATLAB Code.
So is the code of @r3tex is vectorized by Julia for the CPU?
If it is not, when will Julia be doing it automatically?
Is there a way in Julia 0.6 to make Julia generate vectorized code?
This is mainly an LLVM issue and not a Julia issue. Yes, LLVM can automatically SIMD-vectorize code in some cases. You can enable SIMD for a particular loop by using the
@simd macro, and you can also turn on SIMD and other optimizations by running
However, it isn’t always successful. For example, on the fractal code above,
@simd made no difference and
julia -O3 actually makes it slower on my machine, not faster. (Again, an LLVM issue not a Julia one.) This particular loop is going to be hard to automatically SIMD vectorize because of the dependencies in the innermost loop, and the fact that different
z require different numbers of iterations in the innermost loop.
(SIMD vectorization has absolutely nothing to do with the language performance differences in this thread!)
To get better SIMD vectorization of this kind of fractal code is tricky, I’m guessing that for the next few years you would have to do things manually e.g. via SIMD.jl. (I suppose you could restructure the loops to speculatively execute multiple iterations at once, or iterate for multiple
z at once, and then backtrack once escape is detected. I seriously doubt that any compiler will ever be able to do this for you in the foreseeable future.)
Thanks for all your answers! I’ll incorporate many of the learnings from here and the included links, and am especially happy for the big performance improvement from using abs2.
FYI, I just tested using PyPy and it runs in about 11 seconds. ~5x slower than Julia
(Note that PyPy is, unfortunately, pretty hard to use for technical work because it is incompatible with many of the key scientific libraries, from SciPy to TensorFlow. One of the key difficulties for Python, Matlab, and R compilers is that it is hard for them to make things faster without breaking backwards compatibility.)
That’s really nice!
I think it is only ~2.5 slower than Julia (As I think PyPy is single threaded).
@stevengj, I wish someone would reconsider the compatibility thing.
Really, move forward, make things move forward.
Compatibility isn’t the most important thing as long as you keep a way to run old code.
For instant, Mathworks have MCR of all versions.
Since it is there, I would be happy if they break compatibility in the name of performance.
Panasas (storage company) haev an excellent term - Time to Solution. That is what counts - how fast you can get the results for your bank’s overnight risk analysis run/ the simulation which will win your Nobel Prize/rendering your Youtube cat movies. (Delete as appropriate).
I will counter with ‘Good enough for Jazz’ - I guess most of us mortals start by writing a ‘quick and dirty’ script which is only meant to be run once, then ends up taking on a life of its own.
(A cat just came to sit on my keyboard - cat kind are taking revenge for my remark above)
I guess though - stretching the Jazz analogy, the piano IS important and Julia gives is a good piano to start with.
I can’t tell from this if you’re encouraging the Python community to drop compatibility and move forward with PyPy (this is very unlikely to happen since the core developers remain committed to CPython as the implementation of Python), or if you’re exhorting us (the Julia community) to reconsider some compatibility thing.
On the subject of compatibility and how language design choices make it hard for some languages to run fast, it seems to be generally very poorly understood just how hard the designs of languages like Python and R are for really high performance implementations. If you’re interested, here are two talks on the subject:
Armin Ronacher (one of the creators of PyPy), “How Python was Shaped by leaky Internals”:
Jan Vitek, “Making R run Fast”:
This talk contains the following key quotation by Jan:
The title of this talk is largely aspirational (they haven’t made R run much faster), although Jan’s research lab has a variety of projects trying to improve R’s performance.
I meant that compatibility is over rated.
Sometimes it gets in the way of the real object of the project and becomes the object by itself.
For instance you can see Intel is now starting to pay for its fanatic compatibility religion when new architecture which doesn’t hold on their shoulders ~35 years of compatibility and taking advantage of lessons learned are kicking it.
So my point is that I hope, in the future, if compatibility issues stand in the way of Julia to achieve great performance improvements the choice would be to freeze a version which is compatible for anything until this day and create a new point.
I understand it should be something which promise great improvement (No one would break is for meaningless improvement), I just hope compatibility won’t become religion and the main objective of the project.
If Julia objective is to break the 2 languages problem, it means it should do what’s needed to achieve that.