With the toy problem below of approximating Pi by sampling a square I get an almost 4x lower performance with Julia than wit the C++ version. The code is very simple, just two nested 2 loops, no indexing, no type instabilities and any allocations come from the @printf
’s at the end.
function go(n::Int32)
S::Float64 = 1.0 / n
inside::Int64 = 0
t0 = time()
i::Int32 = 0
while i < n
x = (i+0.5)*S
x2 = x*x
j::Int32 = 0
while j < n
y = (j+0.5)*S
y2 = y*y
if x2 + y2 <= 1.0
inside += 1
end
j += 1
end
i += 1
end
t1 = time()
tdiff = t1 - t0
@printf "sampled %d points in %.3fs\n" n*n tdiff
@printf "avg %.1f Mpoints/s\n" (n*n)/(tdiff*1000000)
@printf "%ld inside\n" inside
@printf "pi ≈ %.16f\n" 4*inside/(n*n)
end
For comparison the C++ version (which came first):
// g++ -march=native -O3 -o pi-sample-square pi-sample-square.cpp
#include <cstdio>
#include <cstdlib>
#include <stdint.h>
#include <sys/time.h>
double time()
{
timeval t;
gettimeofday(&t, NULL);
return t.tv_sec + t.tv_usec/1000000.0;
}
int
main(int argc, char *argv[])
{
const int N = atoi(argv[1]);
double S = 1.0 / N;
double t0 = time();
int inside = 0;
for (int64_t i = 0; i < N; i++)
{
double x = (i+0.5)*S;
double x2 = x*x;
for (int j = 0; j < N; j++)
{
double y = (j+0.5)*S;
double y2 = y*y;
if (x2 + y2 <= 1.0)
inside++;
}
}
double t1 = time();
double tdiff = t1 - t0;
printf("sampled %ld points in %.3fs\n", (uint64_t)N*N, tdiff);
printf("avg %.1f Mpoints/s\n", ((uint64_t)(N)*N)/(tdiff*1000000));
printf("%ld inside\n", inside);
printf("pi ≈ %.16lf\n", (4.0*inside)/((uint64_t)(N)*N));
}
Performance comparison:
julia> @btime go(100_000)
sampled 10000000000 points in 6.463s
avg 1547.2 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998
sampled 10000000000 points in 6.461s
avg 1547.8 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998
sampled 10000000000 points in 6.433s
avg 1554.5 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998
6.433 s (25 allocations: 1.73 KiB)
melis@blackbox 10:13:~/concepts/pi-burn$ g++ -march=native -O3 -o pi-sample-square pi-sample-square.cpp
melis@blackbox 10:13:~/concepts/pi-burn$ ./pi-sample-square 100000
sampled 10000000000 points in 1.840s
avg 5433.5 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998
On a hunch I compiled the C++ code with clang and that gives very similar performance to the Julia results, so it might have something to do with LLVM?
melis@blackbox 10:17:~/concepts/pi-burn$ clang++ -march=native -O3 -o pi-sample-square-clang pi-sample-square.cpp
melis@blackbox 10:16:~/concepts/pi-burn$ ./pi-sample-square-clang 100000
sampled 10000000000 points in 6.571s
avg 1521.8 Mpoints/s
7853981733 inside
pi ≈ 3.1415926931999998
Version info:
julia> versioninfo()
Julia Version 1.10.5
Commit 6f3fdf7b362 (2024-08-27 14:19 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 8 × Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
julia> Sys.CPU_NAME
"skylake"
GCC 14.2.1, Clang 18.1.8
Edit: it turns out I screwed up the C++ code I originally included, I updated the version above to use int
for i
and j
, instead of int64_t
as originally