Comparing Python, Julia, and C++

Maurizio_Tomasi · October 31, 2018, 6:03pm

I am going to present Julia at the next ADASS (http://adass2018.astro.umd.edu/), and I would like to show its
ability to fuse broadcasted operations like .+ and .*. I have found some weird results, so I would like to ask if you can help me in understanding what’s going on.

I created three codes in Julia 1.0, Python3+NumPy, and C++. In each code, I run simple computations on large arrays, with an increasing number of parameters. Here are the functions as defined in Julia:

f(x1, x2, x3, x4, x5, x6) = @. x1 + x2
g(x1, x2, x3, x4, x5, x6) = @. x1 + x2 - x3
h(x1, x2, x3, x4, x5, x6) = @. x1 + x2 - x3 + x4
i(x1, x2, x3, x4, x5, x6) = @. x1 + x2 - x3 + x4 - x5
j(x1, x2, x3, x4, x5, x6) = @. x1 + x2 - x3 + x4 - x5 + x6

Python functions are defined similarly, using NumPy arrays:

def f(x1, x2, x3, x4, x5, x6):
    return x1 + x2
    
# and so on

When each function is called, the six parameters are large arrays with 1M of elements. Each function is executed many times, and the minimum value is saved in a text file. The source codes of the Python, Julia, and C++ versions are available at GitHub - ziotom78/python-julia-c-: Speed comparison among Python/NumPy, Julia, and C++

Results are shown in this plot:

On the x axis, I report the number of parameters that have been actually used in the computation. So x=2 corresponds to function f, x=3 to function g, etc. On the y axis I report the minimum elapsed time as measured on my laptop (Lenovo Thinkpad Edge E540), a 64-bit system with 16 GB of RAM running Linux Mint 19 and GCC 7.3.0.

There are a few things that seem reasonable:

Julia is faster than Python
Python scales badly, as it does not fuse loops
Julia and C++ scale similarly

However, I cannot understand these features:

Julia is significantly faster than C++, even when using -O3 with g++. In order to help C++, I cheated and modified the C++ code so that functions f, g, etc. no longer allocate the vector containing the result, which is instead allocated before the benchmark starts (see the code on GitHub). However, as you can see from the plot, Julia is still the best!
For n = 2, C++ is the slowest solution
The cases for n = 2 and n = 3 show equal times in Julia; this is not a statistical fluctuation, as I repeated the test many times with varying number of runs. I wonder how this is possible.

Before showing this plot at ADASS, I would really like to understand everything. Can anybody give me some clue?

tkoolen · October 31, 2018, 6:15pm

Not sure about the original question, but for the Julia code you’re not actually initializing the input arrays:

github.com

ziotom78/python-julia-c-/blob/d6a5a1faa3350498b321c9719293d25752112a1d/julia-speed.jl#L13


      
          using Statistics: minimum
          using Printf
          
          
f(x1, x2, x3, x4, x5, x6) = @. x1 + x2
          g(x1, x2, x3, x4, x5, x6) = @. x1 + x2 - x3
          h(x1, x2, x3, x4, x5, x6) = @. x1 + x2 - x3 + x4
          i(x1, x2, x3, x4, x5, x6) = @. x1 + x2 - x3 + x4 - x5
          j(x1, x2, x3, x4, x5, x6) = @. x1 + x2 - x3 + x4 - x5 + x6
          
          
N = 1000000
          x = [Array{Float64}(undef, N) for i in 1:6]
          
          
function print_result(num, b)
              best = minimum(b)
              # The member "time" is in ns
              @printf("%d\t%8.2f\t%.2f\n", num, best.time / 1e6, best.memory / 1024^2)
          end
          
          
println("Terms\tSpeed [ms]\tMemory [MB]")
          print_result(2, @benchmark f(x...))
          print_result(3, @benchmark g(x...))

This may result in subnormal numbers in your test inputs, which can adversely affect performance. See e.g. 50x speed difference in gemv for different values in vector - #3 by StefanKarpinski.

tkoolen · October 31, 2018, 6:55pm

After initializing the arrays, it seems that Julia is actually closer to C++. On my machine:

C++:

Terms	Speed [ms]
2	0.962909
3	1.80888
4	2.7132
5	3.42612
6	4.05574

Julia before:

Terms	Speed [ms]	Memory [MB]
2	0.82	7.63
3	0.83	7.63
4	0.93	7.63
5	1.21	7.63
6	1.57	7.63

Julia after:

Terms	Speed [ms]	Memory [MB]
2	1.35	7.63
3	2.21	7.63
4	2.87	7.63
5	3.55	7.63
6	4.18	7.63

Also:

I’d change things so that the Julia version also preallocates the result vector, if that’s what’s done in C++.
I’d use -march=native for C++ (maybe, depending on the goal of the benchmark)
I’d use size_t to index into the vectors in C++ (or use an iterator)
Julia could probably be made significantly faster using @simd for. While not as short as the current implementation, it’d be pretty much the same amount of code as C++.
Doing size checks once at the beginning and then using @inbounds will be a lot faster in Julia.

As always, the question is what the objective of the benchmark is, and what’s fair game in terms of tradeoff between optimizations, code readability, and other factors.

davidbp · November 1, 2018, 1:55am

I also tested this but I have faster python code than julia code:

Python

Terms Speed [ms]
2 0.52
3 0.92
4 1.29
5 1.71
6 2.22

Julia 1.0

Terms Speed [ms]
2    1.77
3    2.18
4    2.56
5    2.95
6    3.39

How can we have such a difference between @Maurizio_Tomasi results and the ones I post?

datnamer · November 1, 2018, 2:16am

I think you have multirheading turned on Developer Software Forums - Intel Community

davidbp · November 1, 2018, 2:23am

Sure I have it, don’t I also have it in julia by default? I was expecting matrix/vector operations to be using multithreading already in Julia as well.

kristoffer.carlsson · November 1, 2018, 2:42am

Broadcasting is not multithreaded in Julia.

Maurizio_Tomasi · November 1, 2018, 4:07am

You’re right, this was so stupid! I even included Random, but then I forgot to use it in the initialization of x. I’m redoing the benchmark, I’ll post soon the new plot.

Maurizio_Tomasi · November 1, 2018, 4:43am

Thanks for the suggestions, I implemented them and updated the code on GitHub. I also preallocated the vector in the Python code.

These were excellent suggestions, thanks a lot: this improves the speed of the Julia code and puts it on par with its C++ counterpart (which I compiled with -msse3, in order to use SIMD). However, I feel that the older version has still some value, as it is as succinct as the NumPy code, which is the term of comparison the audience will likely use. Therefore, I have produced two implementations of the Julia benchmark. Here are the updated results:

It seems that in this kind of calculations plain, naïve Julia scales better than NumPy, and with a bit of effort can be as performant as C++.

tkf · November 1, 2018, 5:52am

If preallocation is a fair strategy, you can use the out parameter from Numpy like numpy.add(x1, x2, out=r) to eliminate all the intermediate arrays: numpy.add — NumPy v1.23 Manual

DNF · November 1, 2018, 7:39am

I don’t see @simd making any difference. But you can easily multithread your ‘simd’ code, by replacing @simd with Threads.@threads. Just remember to enable threading first.

That makes a big difference for me.

pkofod · November 1, 2018, 8:42am

Would be interesting to see the result of this, but it still won’t fuse the loops

Maurizio_Tomasi · November 1, 2018, 9:05am

I wouldn’t go through the Threads route, as I feel that the test would compare apples with oranges. The purpose of the plot is to compare some reasonably simple code written in three languages used by astronomers. Complicating it too much would make it less understandable to the audience.

I have added @inbounds and @simd together, without trying to isolate the behaviour of both. Will try to run some more tests.

Maurizio_Tomasi · November 1, 2018, 9:08am

This is a nice suggestion, I wasn’t aware of numpy.add’s out parameters. But the purpose of my exercise was to show how easy-to-write code in NumPy can scale badly, while the same code in Julia behaves better. I fear that the code would become too complicated for the audience I’m targeting.

DNF · November 1, 2018, 9:37am

But are you certain that numpy isn’t multithreading your computation?

Liso · November 1, 2018, 9:47am

With even less effort (just add @numba.jit decorator to functions) I could get better than C++ performance. (it could be just my computer - I am curious about your results!)

It could be better scientific methodology to haven’t results before experiment!

Liso · November 1, 2018, 10:19am

You could probably compare whole solution too. If you run your codes before audience people could see this:

$ time (g++ `gsl-config --cflags` -O3 -march=native -msse3 `gsl-config --libs` c++-speed.cpp && ./a.out)
...
real    0m4,467s
user    0m4,397s
sys     0m0,070s

$ time python python-speed-numba.py 
...
real    0m11,956s
user    0m11,879s
sys     0m0,057s

$ time julia julia-simd-speed.jl 
...
real    1m17,210s
user    1m17,062s
sys     0m0,344s

It suprised me that if I “hacked” BenchmarkTools.DEFAULT_PARAMETERS.samples = 1 !!! it still took long time:

real    0m42,249s
user    0m42,111s
sys     0m0,376s

jtackm · November 1, 2018, 10:44am

Looks more like a constant factor improvement to me, so both seem to scale equally (in contrast to the flatter curves of optimized Julia and C++). In fact, I wouldn’t have expected optimized Julia and C++ to scale better on identical algorithms (SIMD should only improve a constant factor), so maybe there are subtle algorithmic differences. Not sure if pre-allocation alone can cause this difference.

Sounds like comparing g++ to LLVM and possibly different compiler options? The only way I can see numba be faster in a single-thread setting.

ChrisRackauckas · November 1, 2018, 11:13am

Or it’s just one of the performance bugs for broadcast on v1.0.

github.com/JuliaLang/julia

Broadcasting is much slower than a for loop

opened 08:13PM - 15 Jul 18 UTC

YingboMa

performance regression broadcast simd

Here is a minimal working example. ```julia julia> using BenchmarkTools j…ulia> function foo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T @. a = b + 0.1 * (0.2c + 0.3d + 0.4e) nothing end foo (generic function with 1 method) julia> function goo(a::Vector{T}, b::Vector{T}, c::Vector{T}, d::Vector{T}, e::Vector{T}) where T @assert length(a) == length(b) == length(c) == length(d) == length(e) @inbounds for i in eachindex(a) a[i] = b[i] + 0.1 * (0.2c[i] + 0.3d[i] + 0.4e[i]) end nothing end goo (generic function with 1 method) julia> a,b,c,d,e=(rand(1000) for i in 1:5) Base.Generator{UnitRange{Int64},getfield(Main, Symbol("##9#10"))}(getfield(Main, Symbol("##9#10"))(), 1:5) julia> @btime foo($a,$b,$c,$d,$e) 1.277 μs (0 allocations: 0 bytes) julia> @btime goo($a,$b,$c,$d,$e) 345.568 ns (0 allocations: 0 bytes) julia> versioninfo() Julia Version 0.7.0-beta2.12 Commit a878341 (2018-07-15 15:57 UTC) Platform Info: OS: Linux (x86_64-pc-linux-gnu) CPU: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-6.0.0 (ORCJIT, skylake) Environment: JULIA_PKG3_PRECOMPILE = 1 ```

github.com/JuliaLang/julia

LICM for pure functions

opened 12:58PM - 20 Sep 18 UTC

SimonDanisch

optimizer feature

Julia 1.0: I just realized, that this is a gotcha one easily runs into, espec…ially when using the `@.` macro: ```Julia a = rand(1000, 1000) c = 1.0 out = similar(a) julia> @btime $(out) .= $a .- sin.($c); 5.695 ms (0 allocations: 0 bytes) julia> @btime $(out) .= $a .- sin($c); 495.669 μs (0 allocations: 0 bytes) ``` This likely happens because the compiler can't infer that sin is pure. I realize, with having access to the call tree in the new lazy broadcast, we could solve this for a predefined set of functions. First trick could be to just overload `broadcasted` for known signatures: ```Julia Base.Broadcast.broadcasted(::typeof(sin), x::Number) = sin(x) @btime $out .= $a .+ sin.($c) ``` this solves the problem for a chosen set of functions. We could also consider, if we introduce a purity trait to make this easier for multiple argument functions: ```Julia broadcasted(f, args...) = broadcasted(IsPure(f), f, args...) broadcasted(::Pure{true}, f, args...) = f(args...) # should probably not get applied to arrays broadcasted(::Pure{false}, f, args...) = Broadcasted(f, args...) ``` I guess this has been discussed before, but I couldn't really find an issue about it...

Liso · November 1, 2018, 12:06pm

Ah, sorry it was faster only by “epsilon” so I would rather say that all 3 versions (C++, python-numba and Julia-simd) showed equal (or very comparable) performance.

Julia is comparable with julia-simd-speed.jl which use for cycle:

function f(r, x1, x2, x3, x4, x5, x6, x7, x8)
    @inbounds @simd for i in eachindex(x1)
        r[i] = x1[i] + x2[i]
    end
end

Topic		Replies	Views
General questions from Python user Performance	59	4326	March 8, 2021
How hard would it be to implement Numpy.jl, i.e. Numpy in Julia? Numerics faq , python	72	12441	May 9, 2019
Numpy 10x faster than Julia ?! What am I doing wrong ?! [solved - julia faster now] Performance question	37	10993	October 15, 2019
Optimized Python is as good as Julia Performance question	28	25389	June 2, 2025
Vector addition in Julia slower than numpy in Linux Performance	21	1297	May 1, 2020

Comparing Python, Julia, and C++

Related topics