Trying to understand low performance compared to C++

With the toy problem below of approximating Pi by sampling a square I get an almost 4x lower performance with Julia than wit the C++ version. The code is very simple, just two nested 2 loops, no indexing, no type instabilities and any allocations come from the @printf’s at the end.

function go(n::Int32)

    S::Float64 = 1.0 / n
    inside::Int64 = 0

    t0 = time()

    i::Int32 = 0
    while i < n
        x = (i+0.5)*S
        x2 = x*x
        j::Int32 = 0
        while j < n
            y = (j+0.5)*S
            y2 = y*y
            if x2 + y2 <= 1.0
                inside += 1
            end
            j += 1
        end
        i += 1
    end

    t1 = time()
    tdiff = t1 - t0

    @printf "sampled %d points in %.3fs\n" n*n tdiff
    @printf "avg %.1f Mpoints/s\n" (n*n)/(tdiff*1000000)
    @printf "%ld inside\n" inside
    @printf "pi ≈ %.16f\n" 4*inside/(n*n)

end

For comparison the C++ version (which came first):

// g++ -march=native -O3 -o pi-sample-square pi-sample-square.cpp 
#include <cstdio>
#include <cstdlib>
#include <stdint.h>
#include <sys/time.h>

double time()
{
    timeval t;
    gettimeofday(&t, NULL);
    return t.tv_sec + t.tv_usec/1000000.0;
}

int 
main(int argc, char *argv[])
{
    const int N = atoi(argv[1]);
    double S = 1.0 / N;

    double t0 = time();

    int inside = 0;
    for (int64_t i = 0; i < N; i++)
    {
        double x = (i+0.5)*S;
        double x2 = x*x;
        for (int j = 0; j < N; j++)
        {
            double y = (j+0.5)*S;
            double y2 = y*y;
            if (x2 + y2 <= 1.0)
                inside++;
        }
    }

    double t1 = time();
    double tdiff = t1 - t0;

    printf("sampled %ld points in %.3fs\n", (uint64_t)N*N, tdiff);
    printf("avg %.1f Mpoints/s\n", ((uint64_t)(N)*N)/(tdiff*1000000));
    printf("%ld inside\n", inside);
    printf("pi ≈ %.16lf\n", (4.0*inside)/((uint64_t)(N)*N));
}

Performance comparison:

julia> @btime go(100_000)
sampled 10000000000 points in 6.463s
avg 1547.2 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998
sampled 10000000000 points in 6.461s
avg 1547.8 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998
sampled 10000000000 points in 6.433s
avg 1554.5 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998
  6.433 s (25 allocations: 1.73 KiB)
melis@blackbox 10:13:~/concepts/pi-burn$ g++ -march=native -O3 -o pi-sample-square pi-sample-square.cpp
melis@blackbox 10:13:~/concepts/pi-burn$ ./pi-sample-square 100000
sampled 10000000000 points in 1.840s
avg 5433.5 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998

On a hunch I compiled the C++ code with clang and that gives very similar performance to the Julia results, so it might have something to do with LLVM?

melis@blackbox 10:17:~/concepts/pi-burn$ clang++ -march=native -O3 -o pi-sample-square-clang pi-sample-square.cpp 
melis@blackbox 10:16:~/concepts/pi-burn$ ./pi-sample-square-clang 100000
sampled 10000000000 points in 6.571s
avg 1521.8 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998

Version info:

julia> versioninfo()
Julia Version 1.10.5
Commit 6f3fdf7b362 (2024-08-27 14:19 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

julia> Sys.CPU_NAME
"skylake"

GCC 14.2.1, Clang 18.1.8

Edit: it turns out I screwed up the C++ code I originally included, I updated the version above to use int for i and j, instead of int64_t as originally

2 Likes

Just my 2c: the Julia version has while loops, which are less prone to be simd’d. Have you tried with for loops instead?

Could be because of the -ffp-contract compiler option. For GCC it defaults to fast, while Clang is more conservative. Try setting the same option for both compilers, e.g., -ffp-contract=off or -ffp-contract=fast.

They look pretty similar on my laptop:

julia> go(100_000)
sampled 10000000000 points in 9.325s
avg 1072.4 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998

shell> g++ -march=native -O3 -o pi-sample-square pi-sample-square.cpp

shell> ./pi-sample-square 100000
sampled 10000000000 points in 9.874s
avg 1012.8 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998

shell> clang++ -march=native -O3 -o pi-sample-square pi-sample-square.cpp

shell> ./pi-sample-square 100000
sampled 10000000000 points in 9.399s
avg 1064.0 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998

julia> versioninfo()
Julia Version 1.10.5
Commit 6f3fdf7b362 (2024-08-27 14:19 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, haswell)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

The factor of 4 you mention sounds like autovectorisation kicking in or not, which will depend on the specific CPU and compiler.

Yeah, I screwed up the C++ code I included, after I noticed I could not reproduce the 4x faster run anymore. By changing the i and j loop variables to int (instead of int64_t) I get back the performance I originally reported. The C++ code above has been updated.

I’ll write a cleaner followup, as I’m still seeing a difference with Julia that I can’t get rid of.

For what it’s worth, rewriting your code in a more standard julia style gave me a ~4x performance improvement on my machine:

function go2(n)
    S² = 1/n^2
    inside::Int = 0
    tdiff = @elapsed begin
        for i ∈ 1:n
            x² = (i + 0.5)^2*S²
            for j ∈ 1:n
                y² = (j+0.5)^2*S²
                inside += (x² + y² ≤ 1)
            end
        end
    end
    @printf "sampled %d points in %.3fs\n" n*n tdiff
    @printf "avg %.1f Mpoints/s\n" (n*n)/(tdiff*1000000)
    @printf "%ld inside\n" inside
    @printf "pi ≈ %.16f\n" 4*inside/(n*n)
end
julia> go(Int32(100_000))
sampled 1410065408 points in 6.071s
avg 232.2 Mpoints/s
7853981733 inside
pi ≈ 22.2797657142440855

julia> go2(100_000)
sampled 10000000000 points in 1.296s
avg 7715.9 Mpoints/s
7853781734 inside
pi ≈ 3.1415126936000002

Wouldn’t be surprised if there were other performance improvemnts on the table here, but that’s what I see just from casting this in a more standard form.

1 Like

First, the C++ version. As mentioned above, changing i and j to int (instead of int64_t) has a big impact:

melis@blackbox 14:08:~/concepts/pi-burn$ diff -ur pi-sample-square.cpp pi-sample-square2.cpp
--- pi-sample-square.cpp	2024-10-02 14:08:37.523744725 +0200
+++ pi-sample-square2.cpp	2024-10-02 14:04:35.754410457 +0200
@@ -1,4 +1,4 @@
-// g++ -march=native -O3 -o pi-sample-square pi-sample-square.cpp
+// g++ -march=native -O3 -o pi-sample-square2 pi-sample-square2.cpp
 #include <cstdio>
 #include <stdint.h>
 #include <stdlib.h>
@@ -14,18 +14,18 @@
 int 
 main(int argc, char *argv[])
 {
-    const int64_t N = atoi(argv[1]);
+    const int N = atoi(argv[1]);
     double S = 1.0 / N;
 
     double x, x2, y, y2;
     double t0 = time();
     int64_t inside = 0;
 
-    for (int64_t i = 0; i < N; i++)
+    for (int i = 0; i < N; i++)
     {
         x = (i+0.5)*S;
         x2 = x*x;
-        for (int64_t j = 0; j < N; j++)
+        for (int j = 0; j < N; j++)
         {
             y = (j+0.5)*S;
             y2 = y*y;

melis@blackbox 14:08:~/concepts/pi-burn$ g++ -march=native -O3 -o pi-sample-square pi-sample-square.cpp
melis@blackbox 14:08:~/concepts/pi-burn$ g++ -march=native -O3 -o pi-sample-square2 pi-sample-square2.cpp

melis@blackbox 14:08:~/concepts/pi-burn$ ./pi-sample-square 100000
sampled 10000000000 points in 6.833s
avg 1463.5 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998

melis@blackbox 14:09:~/concepts/pi-burn$ ./pi-sample-square2 100000
sampled 10000000000 points in 1.863s
avg 5367.0 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998

There is indeed a difference, but it does not appear to be that large (with all the disclaimers of needing repeated runs to get an average and estimate of variance, etc):

# Using the fast C++ version above as basis
melis@blackbox 14:09:~/concepts/pi-burn$ g++ -ffp-contract=off -march=native -O3 -o pi-sample-square2-off pi-sample-square2.cpp
melis@blackbox 14:11:~/concepts/pi-burn$ g++ -ffp-contract=fast -march=native -O3 -o pi-sample-square2-fast pi-sample-square2.cpp

melis@blackbox 14:11:~/concepts/pi-burn$ ./pi-sample-square2-off 100000
sampled 10000000000 points in 2.209s
avg 4526.4 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998
melis@blackbox 14:11:~/concepts/pi-burn$ ./pi-sample-square2-off 100000
sampled 10000000000 points in 2.190s
avg 4565.5 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998

melis@blackbox 14:11:~/concepts/pi-burn$ ./pi-sample-square2-fast 100000
sampled 10000000000 points in 1.893s
avg 5281.7 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998
melis@blackbox 14:11:~/concepts/pi-burn$ ./pi-sample-square2-fast 100000
sampled 10000000000 points in 1.891s
avg 5287.3 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998

# Default for GCC is indeed fast
melis@blackbox 14:12:~/concepts/pi-burn$ sha256sum pi-sample-square2 pi-sample-square2-off pi-sample-square2-fast
232065962a932dfe00581a1459fd805e1114ce4715f4a3968a97fa4ae0c3e39e  pi-sample-square2
df606b1e932abf4e30dc8d78e25813a1f479586748a55f6eb3d745477d461233  pi-sample-square2-off
232065962a932dfe00581a1459fd805e1114ce4715f4a3968a97fa4ae0c3e39e  pi-sample-square2-fast

So the question now becomes if I can get similar performance from the Julia code?

Trying 3 different versions:

melis@blackbox 14:15:~/concepts/pi-burn$ cat pi-sample-square.jl
using Printf

function go(n::Int)

    S::Float64 = 1.0 / n
    inside::Int64 = 0

    t0 = time()

    i = 0
    while i < n
        x = (i+0.5)*S
        x2 = x*x
        j = 0
        while j < n
            y = (j+0.5)*S
            y2 = y*y
            if x2 + y2 <= 1.0
                inside += 1
            end
            j += 1
        end
        i += 1
    end

    t1 = time()
    tdiff = t1 - t0

    @printf "sampled %d points in %.3fs\n" n*n tdiff
    @printf "avg %.1f Mpoints/s\n" (n*n)/(tdiff*1000000)
    @printf "%ld inside\n" inside
    @printf "pi ≈ %.16f\n" 4*inside/(n*n)

end

N = parse(Int, ARGS[1])

go(N)
@time go(N)

melis@blackbox 14:15:~/concepts/pi-burn$ diff -ur pi-sample-square.jl pi-sample-square2.jl
--- pi-sample-square.jl	2024-10-02 13:59:29.326125401 +0200
+++ pi-sample-square2.jl	2024-10-02 14:00:06.778844803 +0200
@@ -7,11 +7,11 @@
 
     t0 = time()
 
-    i = 0
+    i::Int32 = 0
     while i < n
         x = (i+0.5)*S
         x2 = x*x
-        j = 0
+        j::Int32 = 0
         while j < n
             y = (j+0.5)*S
             y2 = y*y
melis@blackbox 14:15:~/concepts/pi-burn$ diff -ur pi-sample-square.jl pi-sample-square3.jl
--- pi-sample-square.jl	2024-10-02 13:59:29.326125401 +0200
+++ pi-sample-square3.jl	2024-10-02 14:00:40.494958856 +0200
@@ -7,12 +7,10 @@
 
     t0 = time()
 
-    i = 0
-    while i < n
+    for i in 0:n-1
         x = (i+0.5)*S
         x2 = x*x
-        j = 0
-        while j < n
+        for j in 0:n-1
             y = (j+0.5)*S
             y2 = y*y
             if x2 + y2 <= 1.0

Performance:

melis@blackbox 14:16:~/concepts/pi-burn$ julia pi-sample-square.jl 100000
sampled 10000000000 points in 6.542s
avg 1528.6 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998
sampled 10000000000 points in 6.868s
avg 1455.9 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998
  6.868481 seconds (25 allocations: 1.727 KiB)

melis@blackbox 14:16:~/concepts/pi-burn$ julia pi-sample-square2.jl 100000
sampled 10000000000 points in 9.827s
avg 1017.6 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998
sampled 10000000000 points in 9.815s
avg 1018.9 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998
  9.814950 seconds (25 allocations: 1.727 KiB)

melis@blackbox 14:17:~/concepts/pi-burn$ julia pi-sample-square3.jl 100000
sampled 10000000000 points in 6.484s
avg 1542.2 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998
sampled 10000000000 points in 6.460s
avg 1548.1 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998
  6.459681 seconds (25 allocations: 1.727 KiB)

The slowdown of pi-sample-square2.jl is quite interesting, as that was the change that caused the C++ version to get the speedup.

Geh, and now I have to check what @mason posted while I wrote this :slight_smile:

1 Like

Interesting, I’m not seeing the speedup you see here:

melis@blackbox 14:21:~/concepts/pi-burn$ julia
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.5 (2024-08-27)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using Printf

julia> function go2(n)
           S² = 1/n^2
           inside::Int = 0
           tdiff = @elapsed begin
               for i ∈ 1:n
                   x² = (i + 0.5)^2*S²
                   for j ∈ 1:n
                       y² = (j+0.5)^2*S²
                       inside += (x² + y² ≤ 1)
                   end
               end
           end
           @printf "sampled %d points in %.3fs\n" n*n tdiff
           @printf "avg %.1f Mpoints/s\n" (n*n)/(tdiff*1000000)
           @printf "%ld inside\n" inside
           @printf "pi ≈ %.16f\n" 4*inside/(n*n)
       end
go2 (generic function with 1 method)

julia> go2(100_000)
sampled 10000000000 points in 6.630s
avg 1508.3 Mpoints/s
7853781734 inside
pi ≈ 3.1415126936000002

julia> go2(100_000)
sampled 10000000000 points in 6.675s
avg 1498.1 Mpoints/s
7853781734 inside
pi ≈ 3.1415126936000002

julia> go2(Int32(100_000))
sampled 1410065408 points in 6.532s
avg 215.9 Mpoints/s
1107387651 inside
pi ≈ 3.1413795267006508

julia> go2(Int32(100_000))
sampled 1410065408 points in 6.548s
avg 215.3 Mpoints/s
1107387651 inside
pi ≈ 3.1413795267006508

This is maybe a bit off topic since I think you just want to better understand the differences in code generation at hand here, but another easy performance win here can be had with multithreading:

using OhMyThreads

function go3(n; scheduler=StaticScheduler())
    S² = 1/n^2
    tdiff = @elapsed begin
        inside = tmapreduce(+, 1:n; scheduler, init=0) do i
            _inside::Int = 0
            x² = (i + 0.5)^2*S²
            for j ∈ 1:n
                y² = (j+0.5)^2*S²
                _inside += (x² + y² ≤ 1)
            end
            _inside
        end
    end
    @printf "sampled %d points in %.3fs\n" n*n tdiff
    @printf "avg %.1f Mpoints/s\n" (n*n)/(tdiff*1000000)
    @printf "%ld inside\n" inside
    @printf "pi ≈ %.16f\n" 4*inside/(n*n)
end
julia> go3(100_000)
sampled 10000000000 points in 0.202s
avg 49597.8 Mpoints/s
7853781734 inside
pi ≈ 3.1415126936000002

which is a ~6.5x speedup over go2 and a ~30x speedup over go on my 8 core laptop CPU.

1 Like

I know, that was going to be my next step :slight_smile:

Small note: the ranges for i and j need to be 0:n-1 for correct results

1 Like

Didn’t you already do a massive sampling-pi competition at your work a couple of years ago where you got timings down to faster than speed of light (roughly speaking)?

That was using Leibniz formula for π, with a twist:

Ah, I misremembered, sorry!