Nbabel nbody integrator speed up

Oh I had a problem to install the nightly rust with rustup. Because it does not seem compatible with rust installed with apt and I’d like to keep this rust for other stuff. And then I was too lazy to do it in a docker :slight_smile:

1 Like

The greatest difference between the “original” Julia version and the others is that the loop that computed accelerations was not being performed in the same way than in all the other versions. In this post I fixed that. In that version I also use a Julia package called StaticArrays, which is useful when one is dealing with many small arrays and want them to be stack-allocated. Not really comparable, but that is just using a package, like one has to use numpy properly. The rest is just the natural way to write code in Julia. Of course each one, being more familiar to one or other language will find some things more natural than others. At the final version I don’t use the StaticArrays package and manually define the 4-field struct, but the natural way to go would be to use it.

At the end, the function that computes the accelerations (and take most of the time), is very easy to read, and does not contain anything fancy, I think:

function compute_acceleration!(positions, masses, accelerations)
    N = length(positions)

    for i in eachindex(accelerations)
        accelerations[i] = zero(eltype(accelerations))
    end

    for i = 1:N - 1
        for j = i + 1:N
            dr = positions[i] - positions[j]
            rinv3 = 1 / norm(dr)^3
            accelerations[i] -= masses[i] * rinv3 * dr
            accelerations[j] += masses[j] * rinv3 * dr
        end
    end
    return nothing
end

The “strange” things there are maybe the eachindex which can be replaced by 1:length(), and the zeros(eltype()), which could be replaced as well by something syntactically simpler. Those are not for performance, but for generality. That same function can be used for “points” of any dimension, for example, if the appropriate function defining what is the zero of such point is defined. Actually it can be used to any kind of object (the term does not really apply here) for which the programmer defines what are the zero, the norm, subtraction and the multiplication by a scalar.

2 Likes

This isn’t an example of advanced high-performance Julia code, except for the @simd macros and knowing to use length 4 rather than 3. But it reads pretty clean and simple to me (I mean it’s good code!), but just functions and for-loops. This is also the kind of simple code you should be able to get running fast with numpy as there isn’t much different that the compiler would do.

More advanced high-performance julia you will see much more use of the type system to pass more information to the compiler, maybe @generated functions and things like the LoopVectorization.jl @avx macro. But as someone who also writes high performance C++, julia is much easier.

5 Likes

Concerning the benchmarks, as for every of these comparisons, one has to be careful in not comparing things improperly. It does not make much sense that a C++ implementation of this is 40% slower than a Fortran implementation of the same thing, also it does not make much sense that any Julia implementation is faster than both.

I have myself fallen into that trap recently, thinking that a Julia code was faster than my own Fortran implementation of the same thing, and an advice of a advanced Julia user taught me that I was simply using the wrong compiler flags for the compilation of the Fortran code. With that fixed the codes ran at the same speed.

There is not much than can be improved in those implementations without going to parallel or cuda versions, thus it does not make much sense that any of the alternatives (Julia, C++, Fortran, Java) are much slower than the other.

It is also natural that the experienced Python developer that implemented the Python version got a better implementation than the sub-optimal implementations of the other languages. I could not implement the same thing in Python as of today.

That being said, of course one would expect that the Python version was slower, and the efficiency that @paugier gets with Transonic shows what a good job they are doing in bringing fast computing to Python. Kudos to them! That certainly puts pressure in all other languages that want to occupy that space! :slight_smile:

1 Like

It is also natural that the experienced Python developer that implemented the Python version got a better implementation than the sub-optimal implementations of the other languages. I could not implement the same thing in Python as of today.

@lmiq I didn’t write the C++ and Fortran implementations. These implementations have been written by C++ and Fortran developers and used in the paper mentioned in the README (GitHub - paugier/nbabel). They come from the website https://www.nbabel.org/.

It would be really simple even for me who is not a great C++ dev to improve the C++ version. But this would not be very interesting. Of course, C++ can be very fast.

If a Python implementation using Transonic-Pythran is faster than a C++ implementation, it just means that the C++ implementation is not optimal. Because, behind the scene, a pure C++ implementation (without interaction with CPython) has been written automatically from the Python code by Pythran. It is just much more convenient to write it in Python-Numpy rather than in C++.

Regarding the compiler flags, they are all the same for the different implementations.

I vaguely remember that C++ implementation uses N^2 number of calculations for compute_acceleration function, while current Julia and python implementations uses only N^2/2 operations. Since compute_acceleration dominates in these calculations, it’s not a surprise that C++ and Fortran performs worse compared to Julia and Python versions.

I just took a look at the Fortran implementation, and they use some sort of “large” integer and “float” representations (no idea why). Changing to standard precision appears to have improved a bit the time. Also, the acceleration function there computes the total potential energy as well, while ours (python and Julia versions) do not. So that is “unfair” as well, commenting that line brings the time down another 10%.

Anyway, as I understand the purpose of the repository is to show that Transonic-Pythran can make a Python implementation, which is expected to be slow to run in the same order of the the implementations of languages which are expected to be fast. That is clearly demonstrated.

What I do not think is that the remaining differences can be attributed to the languages.

ps. I still cannot run the python version here, I get these errors now:

conda create -n env_py python pythran transonic pandas
Collecting package metadata (current_repodata.json): done
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  - pythran
  - transonic

Current channels:

  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

1 Like

It could be

conda create -n nbabel
conda activate nbabel
conda install python=3
conda install -c conda-forge pythran transonic pandas

No offense, but I miss Manifest.toml already :slight_smile:

2 Likes

Interestingly enough, I’ve got opposite results.

julia nbabel3_16.jl ../data/input16k  131.16s user 0.54s system 100% cpu 2:11.20 total

python bench.py ../data/input16k 0.2  180.51s user 0.41s system 100% cpu 3:00.36 total

so YMMV. I’ve used other julia implementation though: NBabel with 4th dimension for simd calculations. · GitHub

1 Like

Now I get:

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run...

I followed the steps mentioned in the error message, but still nothing… Python is too complicated for me :rofl:

It should be something like

source <path to your conda installation>

In my case for example

source ~/miniconda3/etc/profile.d/conda.sh

P.S.: I hope admins of this forum wouldn’t be too angry with this python discussion.

2 Likes

I would do once conda config --add channels conda-forge and then:

conda create -n nbabel python pythran transonic pandas
conda activate nbabel

We have environment.yaml for conda and pyproject.toml but I didn’t use that for such simple setup.

Don’t worry. I also struggle with Julia installations! In particular, there are many Julia scripts/snippets without instructions for installation in the web.

1 Like

I see 4 possibilities:

  • I use clang for Pythran and I guess you didn’t use that.
$ cat ~/.pythranrc                                                                                                                                                          
[pythran]
complex_hook = True
[compiler]
CXX=clang++-10
CC=clang-10
blas=openblas
  • your Julia version is faster than the one in the repository. If it is the case, PR is really welcome!

  • Hardware dependency

  • My computer prefers Python and yours prefers Julia just because they know their user.

That is why double-blind, placebo trials are necessary.

No luck there, still the same error (the other two installed a bunch of things and did not report anything wrong). This one keeps saying that CommandNotFoundError: Your shell has not been properly configured to use 'conda activate', but using conda init bash as indicated does not work (even after restarting the shell, as suggested). :frowning:

Hello,

I wrote a very simple and naive Julia implementation: https://github.com/paugier/nbabel/blob/master/julia/nbabel_naive.jl

The goal is to have something as simple and directly comparable to https://github.com/paugier/nbabel/blob/master/py/bench_numpy_highlevel.py

Can anyone tell me if there is something completely silly/bad in this Julia code?

I don’t see anything clearly bad, although I cannot test it now (the thing we test is if there are type-instabilities, which can make a huge difference, but I don’t see any at first sight).

If the codes are comparable it would depend on what exactly the data structure is when you do this:

   positions = df.loc[:, ["x", "y", "z"]].values.copy()

In the Julia code you are using mutable arrays for representing all data. If the data there is an array of immutable vectors it is likely that performance will be better, and that can be done in Julia with minor changes as well.

(I am not sure if the simd and fast math flags improve anything there, I would leave them out for simplicity, keeping only the inbounds one. The effectiveness of simd operations in particular is very dependent on the data structure, and won’t be great with (3,N) arrays).

Thanks a lot!

the thing we test is if there are type-instabilities

Is there a page where I can learn how to do that in Julia?

positions = df.loc[:, ["x", "y", "z"]].values.copy()

creates a C-order contiguous array Nx3 (mutable, equivalent to Array{Float64}(undef, 3, N) I guess).

The main tool is @code_warntype; there’s a discussion of basic usage within the performance tips.

Is the the same as an np.array(N,3)? Why don’t use this?

In the python version you do, for example,

position0 = positions[index_p0]

What is going on there? Is that identical to copying the data to a mutable np.array(3) and is it implicit that you are copying over the other dimensions? Why would you do this copy at all in this case instead of referring to the 3 indexes inside the loop?

I think I should have use (more explicit, but same result):

positions = np.ascontiguousarray(df.loc[:, ["x", "y", "z"]].values)

Regarding

position0 = positions[index_p0]

it creates a “view” without copy. positions[index_p0] is equivalent to positions[index_p0, :]. position0 and positions share the same data:

In [1]: import numpy as np                                                      

In [2]: a = np.arange(6).reshape((3, 2))                                        

In [3]: a                                                                       
Out[3]: 
array([[0, 1],
       [2, 3],
       [4, 5]])

In [4]: b = a[0]                                                                

In [5]: b[:] = 1                                                                

In [6]: a                                                                       
Out[6]: 
array([[1, 1],
       [2, 3],
       [4, 5]])

Why would you do this copy at all in this case instead of referring to the 3 indexes inside the loop?

Shorter, more readable, more general (works for N dim arrays) and no copy. I guess Python-Numpy is designed with the belief that too much loops and indices can decrease readability and maintainability.

But now I realize that the behavior is different in Julia and that one needs the @view macro to do that…

julia> a = reshape(collect(0:5), (2, 3))
2×3 Array{Int64,2}:
 0  2  4
 1  3  5

julia> b = a[:, 1];

julia> b[:] .= 1;

julia> a
2×3 Array{Int64,2}:
 0  2  4
 1  3  5

julia> v = @view a[:, 1];

julia> v[:] .= 1;

julia> a
2×3 Array{Int64,2}:
 1  2  4
 1  3  5

Good to know! :slight_smile:

1 Like