Julia not 220 times faster than C++

I have a small piece of code that I actually use in my simulation, and just compared the performance between C++ and Julia. One function call needs about 7ns with Julia and 1550 ns with C++ and Eigen3.
Why is C++ so slow? Am I doing something wrong?

using StaticArrays, Parameters, BenchmarkTools, LinearAlgebra

@with_kw mutable struct KPS{T}
    v_wind_tether::T =    zeros(3)
    v_apparent::T =       zeros(3)
    last_tether_drag::T = zeros(3)  
end

const kps = KPS{MVector{3, Float64}}()

function calc_drag!(s::KPS, v_segment, unit_vector, rho, v_app_perp, area)
    s.v_apparent .= s.v_wind_tether - v_segment
    v_app_norm = norm(s.v_apparent)
    v_app_perp .= s.v_apparent .- s.v_apparent â‹… unit_vector .* unit_vector
    s.last_tether_drag .= -0.5 * rho * norm(v_app_perp) * area .* v_app_perp
    v_app_norm
end

@benchmark calc_drag!(kps, v_segment, unit_vector, rho, v_app_perp, area) setup=(
     v_segment=MVector(0.0, 0, 0); unit_vector=MVector(0.0,1,0); rho=1.0; 
     v_app_perp=MVector(10.0,0,0); area=0.3)

And the C++ version:

#include <iostream>
#include <Eigen/Dense>
#include <chrono>

using namespace std;
using Eigen::Vector3d;
using namespace std::chrono;

struct KPS {
    Vector3d v_wind_tether;
    Vector3d v_apparent;
    Vector3d last_tether_drag;
} ;

KPS kps;

double calc_drag(KPS s, Vector3d v_segment, Vector3d unit_vector, double rho, Vector3d v_app_perp, double area)
{
    s.v_apparent = s.v_wind_tether - v_segment;
    double v_app_norm = s.v_apparent.norm();
    v_app_perp = s.v_apparent - s.v_apparent.dot(unit_vector) * unit_vector;
    s.last_tether_drag = -0.5 * rho * v_app_perp.norm() * area * v_app_perp;
    return v_app_norm;
}

int main()
{
    // std::cout << kps.v_wind_tether << std::endl;
    Vector3d v_segment(0,0,0); Vector3d unit_vector(0,1,0); double rho = 1.0;
    Vector3d v_app_perp(10,0,0); double area = 0.3;
    auto start = high_resolution_clock::now();
    for(int i=0; i < 1000; i++) {
        calc_drag(kps, v_segment, unit_vector, rho, v_app_perp, area);
    }
    auto stop = high_resolution_clock::now();
    cout << "calc_drag needs "
         << ((stop - start).count())/1000
         << " ns." << endl;
}

And the file CMakeLists.txt:

cmake_minimum_required(VERSION 3.0)

project(Tutorial)

find_package (Eigen3 3.3 REQUIRED NO_MODULE)

add_executable(${PROJECT_NAME} "main.cpp")
target_link_libraries (${PROJECT_NAME} Eigen3::Eigen)

My problem is, if I tell this on a scientific conference people won’t believe me.

6 Likes

I have never used CMakeLists.txt for long (started disliking it very soon), do it guarantees that the C++ code is being compiled with -O3?

I think that you could improve the C++ code by passing const ref vec3d Eigen vectors.

If I remember well you can also avoid temp vector (vec_apparent) by applying norm method to the difference expression. Something like (v1-v2).norm()

You have to configure with -DCMAKE_BUILD_TYPE=Release, which I believe is not the default.

1 Like

Some observations:

  • The Julia and C++ versions of calc_drag() are not equivalent. The Julia version receives s, v_segment and friends by reference, while the C++ version receives them by value. Hence, the modifications within calc_drag() for the C++ version have no effect and it merely computes and returns s.v_apparent.norm().

  • Since the code is a single source file without any libraries to link (Eigen is a header-only library) you don’t really need CMake to build it. If you know the location of the Eigen headers it’s a single line of calling g++ to compile it to optimized code: g++ -o t2 -O3 -march=native -I/usr/include/eigen3 t2.cc.

  • The C++ time you report seems to indicate it was compiled without optimization, which is the default for GCC if you don’t pass any -O flags (as CMake won’t do by default I believe). On my PC here (Core™ i5-4460 CPU @ 3.20GHz, GCC 11.2.0) turning on optimization makes the time per step go to zero:

    melis@juggle 08:56:~$ g++ -o t2o -I/usr/include/eigen3  t2o.cc 
    melis@juggle 08:57:~$ ./t2o
    calc_drag needs 3769 ns.
    melis@juggle 08:57:~$ g++ -o t2o -O3 -march=native -I/usr/include/eigen3  t2o.cc 
    melis@juggle 08:57:~$ ./t2o
    calc_drag needs 0 ns.
    

    That is goes to zero is probably due to most of the code in calc_drag() having no effect (due to passing by value) and getting optimized out.

  • From what I read here std::chrono::high_resolution_clock isn’t the best clock source for these kind of measurements, as it is system-dependent which source is used and std::chrono::steady_clock is more appropriate. I have no experience with these classes, but assume that that info is correct.

All in all, if I update your original code to pass the parameters by reference and switch to the other clock class (see full code below) I get these timings:

melis@juggle 09:02:~$ ./t3
calc_drag needs 6.27421 ns.
melis@juggle 09:02:~$ ./t3
calc_drag needs 4.50141 ns.
melis@juggle 09:02:~$ ./t3
calc_drag needs 6.71724 ns.
melis@juggle 09:02:~$ ./t3
calc_drag needs 6.04979 ns.
melis@juggle 09:03:~$ ./t3
calc_drag needs 5.79616 ns.
melis@juggle 09:03:~$ ./t3
calc_drag needs 5.04113 ns.
melis@juggle 09:03:~$ ./t3
calc_drag needs 6.7191 ns.

So quite a bit of variance, but these at least appear to be in the same ballpark as what you report for the Julia version, which is what I would expect for such simple code.

#include <iostream>
#include <Eigen/Dense>
#include <chrono>

using namespace std;
using Eigen::Vector3d;
using namespace std::chrono;

struct KPS {
    Vector3d v_wind_tether;
    Vector3d v_apparent;
    Vector3d last_tether_drag;
} ;

KPS kps;

double calc_drag(KPS& s, Vector3d& v_segment, Vector3d& unit_vector, double rho, Vector3d& v_app_perp, double area)
{
    s.v_apparent = s.v_wind_tether - v_segment;
    double v_app_norm = s.v_apparent.norm();
    v_app_perp = s.v_apparent - s.v_apparent.dot(unit_vector) * unit_vector;
    s.last_tether_drag = -0.5 * rho * v_app_perp.norm() * area * v_app_perp;
    return v_app_norm;
}

int main()
{
    // std::cout << kps.v_wind_tether << std::endl;
    Vector3d v_segment(0,0,0); Vector3d unit_vector(0,1,0); double rho = 1.0;
    Vector3d v_app_perp(10,0,0); double area = 0.3;
    auto start = steady_clock::now();
    const int N = 100000;
    double res;
    for(int i=0; i < N; i++) {
        res += calc_drag(kps, v_segment, unit_vector, rho, v_app_perp, area);
    }
    auto stop = steady_clock::now();
    auto duration = stop - start;
    cout << "calc_drag needs "
         << (1.0f * duration_cast<nanoseconds>(duration).count() / N)
         << " ns." << endl;
}
9 Likes

Thank you very much!

But should’t it be

kps&,

instead of

?

Uh, no? The C++ syntax for passing by reference is (<type>& <variable>, ...), e.g. KPS& s. Or do you mean something else?

1 Like

OK, thank you for freshening up my rusted C++ knowledge!
Was programming too much in Julia in the last time…

My latest CPP code:

#include <iostream>
#include <Eigen/Dense>
#include <chrono>

using namespace std;
using Eigen::Vector3d;
using namespace std::chrono;

struct KPS {
    Vector3d v_wind_tether;
    Vector3d v_apparent;
    Vector3d last_tether_drag;
} ;

KPS kps;

double calc_drag(KPS& s, Vector3d& v_segment, Vector3d& unit_vector, double rho, Vector3d& v_app_perp, double area)
{
    s.v_apparent = s.v_wind_tether - v_segment;
    double v_app_norm = s.v_apparent.norm();
    v_app_perp = s.v_apparent - s.v_apparent.dot(unit_vector) * unit_vector;
    s.last_tether_drag = -0.5 * rho * v_app_perp.norm() * area * v_app_perp;
    return v_app_norm;
}

int main()
{
    // std::cout << kps.v_wind_tether << std::endl;
    Vector3d v_segment(0.1,0.2,0.3); Vector3d unit_vector(0.0,1,0); double rho = 1.0;
    Vector3d v_app_perp(10,0,0); double area = 0.3;
    double v_app_norm;
    auto start = high_resolution_clock::now();
    for(int i=0; i < 100000; i++) {
       v_app_norm = calc_drag(kps, v_segment, unit_vector, rho, v_app_perp, area);
    }
    auto stop = high_resolution_clock::now();
    cout << "calc_drag needs "
         << ((stop - start).count())/100000
         << " ns." << endl;
}

And my CMakeLists.txt:

cmake_minimum_required(VERSION 3.0)

project(Tutorial)

find_package (Eigen3 3.3 REQUIRED NO_MODULE)

add_executable(${PROJECT_NAME} "main.cpp")
target_link_libraries (${PROJECT_NAME} Eigen3::Eigen)

if(NOT CMAKE_BUILD_TYPE)
  set(CMAKE_BUILD_TYPE Release)
endif()

set(CMAKE_CXX_FLAGS "-Wall -Wextra")
set(CMAKE_CXX_FLAGS_DEBUG "-g")
set(CMAKE_CXX_FLAGS_RELEASE "-O2")

Which gives an execution time of 5ns with -O2 optimization level, the same level I used for Julia.

But measuring these very short times is tricky on C++.

2 Likes

But what I do not understand: Why is :

   s.last_tether_drag = -0.5 * rho * v_app_perp.norm() * area * v_app_perp;

not allocating any memory? In Julia we need .= to achieve an non-allocating assignment. Is this not needed in C++ ?

C++ supports overloading = and Eigen takes advantage of that to avoid allocations.

3 Likes