I like the visuals in this one. As usual, Julia compilation seems to be included in the results:
P.S.: we donβt have a HPC
category on discourse. Should it be created? The GPU
and Julia at Scale
categories could be subcategories.
I like the visuals in this one. As usual, Julia compilation seems to be included in the results:
P.S.: we donβt have a HPC
category on discourse. Should it be created? The GPU
and Julia at Scale
categories could be subcategories.
The not-yet-default version without it except for hello-world: Languages Benchmark Visualization. Much fewer languages on there, still not clear on what specific implementations or versions any of them are, and I havenβt been able to click my way to a comprehensive information list if it exists.
Not sure why there seems to be a 62-64ms island (Java, C, Rust, etc) and a 205-232ms island (Python JIT, Julia, Racket) for the loops benchmark. The number for Julia is plausible, I quickly tried the custom benchmark module and BenchmarkTools
and got ~158ms for the loops call either way, which is an expected difference for CPUs released within a few years of each other. I tried preallocating the array and switching up the integer types but saved 1ms at best. If anyone can spot the important difference between the Julia version (64-bit signed Int
) and the Rust (also LLVM; 32-bit unsigned u32
) or C version (gcc; 32-bit? signed int
), feel free to share. Could it just be 10k-element vectors on the stack?
The fibonacci result of 0.00 for Julia (woohoo 1st place?) is caused by a deliberate use of Val
inputs to shift all the work to compile-time, which is not included in the benchmark results due to function barriers (not a discarded warmup benchmark run as the custom benchmark implementation suggests). Itβs as valid as running a more typical program in a setup to a benchmark of just passing the result, and Iβm sure other languages can pull off the same thing with compile-time computation, just probably AOT.
I donβt really understand how there are not one, but four, languages above C on a simple loop. (and one of them being java ..?)
Iβd chalk a 0.07ms discrepancy out of 63ms to OS jitter, itβs not a real-time environment as acknowledged.
I suspect most of the difference is due to the 32 vs 64 bit integers. I changed the function to use the type of its input throughout.
function loops(u::T)::T where {T}
a = zeros(T, 10^4) # Allocate an array of 10,000 zeros
r = rand(T(1):T(10^4)) # Choose a random index between 1 and 10,000
@inbounds for i in T(1):T(10^4) # Outer loop over array indices
@inbounds for j in T(1):T(10^4) # Inner loop: 10,000 iterations per outer loop iteration
a[i] += j % u # Simple sum
end
a[i] += r # Add a random value to each element in array
end
return @inbounds a[r] # Return the element at the random index
end
On an older Intel cpu, I get
julia> versioninfo()
Julia Version 1.11.3
Commit d63adeda50d (2025-01-21 19:42 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 4 Γ Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, ivybridge)
Threads: 4 default, 0 interactive, 2 GC (on 4 virtual cores)
Environment:
JULIA_NUM_THREADS = 4
julia> @benchmark loops(10)
BenchmarkTools.Trial: 16 samples with 1 evaluation per sample.
Range (min β¦ max): 326.743 ms β¦ 335.066 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 328.532 ms β GC (median): 0.00%
Time (mean Β± Ο): 328.833 ms Β± 1.961 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
β β ββ β βββ ββ β β β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
327 ms Histogram: frequency by time 335 ms <
Memory estimate: 78.16 KiB, allocs estimate: 2.
julia> @benchmark loops(Int32(10))
BenchmarkTools.Trial: 24 samples with 1 evaluation per sample.
Range (min β¦ max): 211.762 ms β¦ 217.078 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 213.031 ms β GC (median): 0.00%
Time (mean Β± Ο): 213.261 ms Β± 1.170 ms β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ β β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
212 ms Histogram: frequency by time 217 ms <
Memory estimate: 39.09 KiB, allocs estimate: 2.
The Int32 benchmark is pretty much equal to the C benchmark.
[loops] $ gcc benchmark.c loops.c -lm -O3
[loops] $ ./a.out 2000 3000 10
...
..
240.228608,2.209823,238.869205,246.366234,9,54136
There seems to be some CPU dependence. I also ran things on a much newer AMD system, and got the same time for Int32 and Int64, both of which match the C version.
julia> versioninfo()
Julia Version 1.11.4
Commit 8561cc3d68d (2025-03-10 11:36 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 24 Γ AMD Ryzen 9 9900X 12-Core Processor
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, generic)
Threads: 12 default, 0 interactive, 6 GC (on 24 virtual cores)
Environment:
JULIA_NUM_THREADS = 12
julia> @benchmark loops(10)
BenchmarkTools.Trial: 47 samples with 1 evaluation per sample.
Range (min β¦ max): 107.854 ms β¦ 108.336 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 107.926 ms β GC (median): 0.00%
Time (mean Β± Ο): 107.958 ms Β± 103.468 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β β β βββββ β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
108 ms Histogram: frequency by time 108 ms <
Memory estimate: 78.16 KiB, allocs estimate: 2.
julia> @benchmark loops(Int32(10))
BenchmarkTools.Trial: 47 samples with 1 evaluation per sample.
Range (min β¦ max): 107.177 ms β¦ 108.378 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 107.295 ms β GC (median): 0.00%
Time (mean Β± Ο): 107.338 ms Β± 193.199 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
ββ β
β
ββ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
107 ms Histogram: frequency by time 108 ms <
Memory estimate: 39.09 KiB, allocs estimate: 2.
[loops] $ gcc benchmark.c loops.c -O3 -lm
[loops] $ ./a.out 2000 2000 10
..
..
106.522007,0.079939,106.462386,106.836421,19,54333
Removing allocations as in the loops_noalloc
function below makes little difference (results not shown).
using StaticArrays, LoopVectorization
function loops_noalloc(u::T)::T where {T}
a = @MVector zeros(T, 10^4) # Allocate an array of 10,000 zeros
r = rand(T(1):T(10^4)) # Choose a random index between 1 and 10,000
@inbounds for i in T(1):T(10^4) # Outer loop over array indices
@inbounds for j in T(1):T(10^4) # Inner loop: 10,000 iterations per outer loop iteration
a[i] += j % u # Simple sum
end
a[i] += r # Add a random value to each element in array
end
return @inbounds a[r] # Return the element at the random index
end
function loops_fast(u::T)::T where {T}
a = @MVector zeros(T, 10^4) # Allocate an array of 10,000 zeros
r = rand(T(1):T(10^4)) # Choose a random index between 1 and 10,000
@turbo for i in T(1):T(10000) # Outer loop over array indices
for j in T(1):T(10000) # Inner loop: 10,000 iterations per outer loop iteration
a[i] += j % u # Simple sum
end
a[i] = a[i] + r # Add a random value to each element in array
end
return @inbounds a[r] # Return the element at the random index
end
However, you can make things much faster by using LoopVectorization (admittedly, itβs questionable whether this is in the spirit of the original language comparison benchmark). On my newer AMD machine, I get
julia> @benchmark loops_fast(Int64(10))
BenchmarkTools.Trial: 2986 samples with 1 evaluation per sample.
Range (min β¦ max): 1.672 ms β¦ 1.775 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 1.674 ms β GC (median): 0.00%
Time (mean Β± Ο): 1.674 ms Β± 4.688 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
β
βββββββββ β
βββββββββββββ
ββββββ
ββ
ββββ
ββ
β
β
βββββ
βββ
ββ
βββ
ββββββ
βββ
ββββ
β
β β
1.67 ms Histogram: log(frequency) by time 1.7 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark loops_fast(Int32(10))
BenchmarkTools.Trial: 5926 samples with 1 evaluation per sample.
Range (min β¦ max): 841.047 ΞΌs β¦ 1.289 ms β GC (min β¦ max): 0.00% β¦ 0.00%
Time (median): 842.871 ΞΌs β GC (median): 0.00%
Time (mean Β± Ο): 843.270 ΞΌs Β± 8.455 ΞΌs β GC (mean Β± Ο): 0.00% Β± 0.00%
βββ
βββββββ
ββββ β
βββββββββββββββββββββ
βββ
ββββ
ββ
βββ
ββ
ββ
ββ
ββββββββ
βββ
ββββ
βββββ
β
841 ΞΌs Histogram: log(frequency) by time 855 ΞΌs <
Memory estimate: 0 bytes, allocs estimate: 0.
I think I misread the Rust (I donβt use it), another file specifies u32
for array elements. If I did know Rust, Iβd have matched the numeric types and compared the LLVM IR.
Happened to me too, but on Intel (i7-1065G7). The 1.5x difference for your older Intel CPU makes sense for a 2x change in numeric type size, but the 3.25x difference on the benchmarkβs M1 Max (also now noticing the M4 Max option with a 2.5x difference) surprises me.