Comparing Python, Julia, and C++

broadcast
python

#1

I am going to present Julia at the next ADASS (http://adass2018.astro.umd.edu/), and I would like to show its
ability to fuse broadcasted operations like .+ and .*. I have found some weird results, so I would like to ask if you can help me in understanding what’s going on.

I created three codes in Julia 1.0, Python3+NumPy, and C++. In each code, I run simple computations on large arrays, with an increasing number of parameters. Here are the functions as defined in Julia:

f(x1, x2, x3, x4, x5, x6) = @. x1 + x2
g(x1, x2, x3, x4, x5, x6) = @. x1 + x2 - x3
h(x1, x2, x3, x4, x5, x6) = @. x1 + x2 - x3 + x4
i(x1, x2, x3, x4, x5, x6) = @. x1 + x2 - x3 + x4 - x5
j(x1, x2, x3, x4, x5, x6) = @. x1 + x2 - x3 + x4 - x5 + x6

Python functions are defined similarly, using NumPy arrays:

def f(x1, x2, x3, x4, x5, x6):
    return x1 + x2
    
# and so on

When each function is called, the six parameters are large arrays with 1M of elements. Each function is executed many times, and the minimum value is saved in a text file. The source codes of the Python, Julia, and C++ versions are available at https://github.com/ziotom78/python-julia-c-

Results are shown in this plot:

On the x axis, I report the number of parameters that have been actually used in the computation. So x=2 corresponds to function f, x=3 to function g, etc. On the y axis I report the minimum elapsed time as measured on my laptop (Lenovo Thinkpad Edge E540), a 64-bit system with 16 GB of RAM running Linux Mint 19 and GCC 7.3.0.

There are a few things that seem reasonable:

  • Julia is faster than Python
  • Python scales badly, as it does not fuse loops
  • Julia and C++ scale similarly

However, I cannot understand these features:

  • Julia is significantly faster than C++, even when using -O3 with g++. In order to help C++, I cheated and modified the C++ code so that functions f, g, etc. no longer allocate the vector containing the result, which is instead allocated before the benchmark starts (see the code on GitHub). However, as you can see from the plot, Julia is still the best!
  • For n = 2, C++ is the slowest solution
  • The cases for n = 2 and n = 3 show equal times in Julia; this is not a statistical fluctuation, as I repeated the test many times with varying number of runs. I wonder how this is possible.

Before showing this plot at ADASS, I would really like to understand everything. Can anybody give me some clue?


#2

Not sure about the original question, but for the Julia code you’re not actually initializing the input arrays:

This may result in subnormal numbers in your test inputs, which can adversely affect performance. See e.g. 50x speed difference in gemv for different values in vector.


#3

After initializing the arrays, it seems that Julia is actually closer to C++. On my machine:

C++:

Terms Speed [ms]
2 0.962909
3 1.80888
4 2.7132
5 3.42612
6 4.05574

Julia before:

Terms Speed [ms] Memory [MB]
2 0.82 7.63
3 0.83 7.63
4 0.93 7.63
5 1.21 7.63
6 1.57 7.63

Julia after:

Terms Speed [ms] Memory [MB]
2 1.35 7.63
3 2.21 7.63
4 2.87 7.63
5 3.55 7.63
6 4.18 7.63

Also:

  • I’d change things so that the Julia version also preallocates the result vector, if that’s what’s done in C++.
  • I’d use -march=native for C++ (maybe, depending on the goal of the benchmark)
  • I’d use size_t to index into the vectors in C++ (or use an iterator)
  • Julia could probably be made significantly faster using @simd for. While not as short as the current implementation, it’d be pretty much the same amount of code as C++.
  • Doing size checks once at the beginning and then using @inbounds will be a lot faster in Julia.

As always, the question is what the objective of the benchmark is, and what’s fair game in terms of tradeoff between optimizations, code readability, and other factors.


#4

I also tested this but I have faster python code than julia code:

Python

Terms Speed [ms]
2 0.52
3 0.92
4 1.29
5 1.71
6 2.22

Julia 1.0

Terms Speed [ms]
2    1.77
3    2.18
4    2.56
5    2.95
6    3.39

How can we have such a difference between @Maurizio_Tomasi results and the ones I post?


#5

I think you have multirheading turned on https://software.intel.com/en-us/forums/intel-distribution-for-python/topic/567669


#6

Sure I have it, don’t I also have it in julia by default? I was expecting matrix/vector operations to be using multithreading already in Julia as well.


#7

Broadcasting is not multithreaded in Julia.


#8

:blush: You’re right, this was so stupid! I even included Random, but then I forgot to use it in the initialization of x. I’m redoing the benchmark, I’ll post soon the new plot.


#9

Thanks for the suggestions, I implemented them and updated the code on GitHub. I also preallocated the vector in the Python code.

These were excellent suggestions, thanks a lot: this improves the speed of the Julia code and puts it on par with its C++ counterpart (which I compiled with -msse3, in order to use SIMD). However, I feel that the older version has still some value, as it is as succinct as the NumPy code, which is the term of comparison the audience will likely use. Therefore, I have produced two implementations of the Julia benchmark. Here are the updated results:

It seems that in this kind of calculations plain, naïve Julia scales better than NumPy, and with a bit of effort can be as performant as C++.


#10

If preallocation is a fair strategy, you can use the out parameter from Numpy like numpy.add(x1, x2, out=r) to eliminate all the intermediate arrays: https://docs.scipy.org/doc/numpy/reference/generated/numpy.add.html


#11

I don’t see @simd making any difference. But you can easily multithread your ‘simd’ code, by replacing @simd with Threads.@threads. Just remember to enable threading first.

That makes a big difference for me.


#12

Would be interesting to see the result of this, but it still won’t fuse the loops


#13

I wouldn’t go through the Threads route, as I feel that the test would compare apples with oranges. The purpose of the plot is to compare some reasonably simple code written in three languages used by astronomers. Complicating it too much would make it less understandable to the audience.

I have added @inbounds and @simd together, without trying to isolate the behaviour of both. Will try to run some more tests.


#14

This is a nice suggestion, I wasn’t aware of numpy.add's out parameters. But the purpose of my exercise was to show how easy-to-write code in NumPy can scale badly, while the same code in Julia behaves better. I fear that the code would become too complicated for the audience I’m targeting.


#15

But are you certain that numpy isn’t multithreading your computation?


#16

With even less effort (just add @numba.jit decorator to functions) I could get better than C++ performance. (it could be just my computer - I am curious about your results!)

It could be better scientific methodology to haven’t results before experiment! :wink:


#17

You could probably compare whole solution too. If you run your codes before audience people could see this:

$ time (g++ `gsl-config --cflags` -O3 -march=native -msse3 `gsl-config --libs` c++-speed.cpp && ./a.out)
...
real    0m4,467s
user    0m4,397s
sys     0m0,070s

$ time python python-speed-numba.py 
...
real    0m11,956s
user    0m11,879s
sys     0m0,057s

$ time julia julia-simd-speed.jl 
...
real    1m17,210s
user    1m17,062s
sys     0m0,344s

It suprised me that if I “hacked” BenchmarkTools.DEFAULT_PARAMETERS.samples = 1 !!! it still took long time:

real    0m42,249s
user    0m42,111s
sys     0m0,376s

#18

Looks more like a constant factor improvement to me, so both seem to scale equally (in contrast to the flatter curves of optimized Julia and C++). In fact, I wouldn’t have expected optimized Julia and C++ to scale better on identical algorithms (SIMD should only improve a constant factor), so maybe there are subtle algorithmic differences. Not sure if pre-allocation alone can cause this difference.

Sounds like comparing g++ to LLVM and possibly different compiler options? The only way I can see numba be faster in a single-thread setting.


#19

Or it’s just one of the performance bugs for broadcast on v1.0.


#20

Ah, sorry :confused: it was faster only by “epsilon” so I would rather say that all 3 versions (C++, python-numba and Julia-simd) showed equal (or very comparable) performance.

Julia is comparable with julia-simd-speed.jl which use for cycle:

function f(r, x1, x2, x3, x4, x5, x6, x7, x8)
    @inbounds @simd for i in eachindex(x1)
        r[i] = x1[i] + x2[i]
    end
end