I am going to present Julia at the next ADASS (http://adass2018.astro.umd.edu/), and I would like to show its

ability to fuse broadcasted operations like `.+`

and `.*`

. I have found some weird results, so I would like to ask if you can help me in understanding what’s going on.

I created three codes in Julia 1.0, Python3+NumPy, and C++. In each code, I run simple computations on large arrays, with an increasing number of parameters. Here are the functions as defined in Julia:

```
f(x1, x2, x3, x4, x5, x6) = @. x1 + x2
g(x1, x2, x3, x4, x5, x6) = @. x1 + x2 - x3
h(x1, x2, x3, x4, x5, x6) = @. x1 + x2 - x3 + x4
i(x1, x2, x3, x4, x5, x6) = @. x1 + x2 - x3 + x4 - x5
j(x1, x2, x3, x4, x5, x6) = @. x1 + x2 - x3 + x4 - x5 + x6
```

Python functions are defined similarly, using NumPy arrays:

```
def f(x1, x2, x3, x4, x5, x6):
return x1 + x2
# and so on
```

When each function is called, the six parameters are large arrays with 1M of elements. Each function is executed many times, and the minimum value is saved in a text file. The source codes of the Python, Julia, and C++ versions are available at GitHub - ziotom78/python-julia-c-: Speed comparison among Python/NumPy, Julia, and C++

Results are shown in this plot:

On the x axis, I report the number of parameters that have been actually used in the computation. So `x=2`

corresponds to function `f`

, `x=3`

to function `g`

, etc. On the y axis I report the minimum elapsed time as measured on my laptop (Lenovo Thinkpad Edge E540), a 64-bit system with 16 GB of RAM running Linux Mint 19 and GCC 7.3.0.

There are a few things that seem reasonable:

- Julia is faster than Python
- Python scales badly, as it does not fuse loops
- Julia and C++ scale similarly

However, I cannot understand these features:

- Julia is significantly faster than C++, even when using
`-O3`

with`g++`

. In order to help C++, I cheated and modified the C++ code so that functions`f`

,`g`

, etc. no longer allocate the vector containing the result, which is instead allocated before the benchmark starts (see the code on GitHub). However, as you can see from the plot, Julia is still the best! - For
`n = 2`

, C++ is the slowest solution - The cases for
`n = 2`

and`n = 3`

show equal times in Julia; this is not a statistical fluctuation, as I repeated the test many times with varying number of runs. I wonder how this is possible.

Before showing this plot at ADASS, I would really like to understand everything. Can anybody give me some clue?