Yet another language benchmark

juliohm · April 10, 2025, 11:37am

I like the visuals in this one. As usual, Julia compilation seems to be included in the results:

P.S.: we don’t have a HPC category on discourse. Should it be created? The GPU and Julia at Scale categories could be subcategories.

Benny · April 10, 2025, 2:47pm

The not-yet-default version without it except for hello-world: Languages Benchmark Visualization. Much fewer languages on there, still not clear on what specific implementations or versions any of them are, and I haven’t been able to click my way to a comprehensive information list if it exists.

Not sure why there seems to be a 62-64ms island (Java, C, Rust, etc) and a 205-232ms island (Python JIT, Julia, Racket) for the loops benchmark. The number for Julia is plausible, I quickly tried the custom benchmark module and BenchmarkTools and got ~158ms for the loops call either way, which is an expected difference for CPUs released within a few years of each other. I tried preallocating the array and switching up the integer types but saved 1ms at best. If anyone can spot the important difference between the Julia version (64-bit signed Int) and the Rust (also LLVM; 32-bit unsigned u32) or C version (gcc; 32-bit? signed int), feel free to share. Could it just be 10k-element vectors on the stack?

The fibonacci result of 0.00 for Julia (woohoo 1st place?) is caused by a deliberate use of Val inputs to shift all the work to compile-time, which is not included in the benchmark results due to function barriers (not a discarded warmup benchmark run as the custom benchmark implementation suggests). It’s as valid as running a more typical program in a setup to a benchmark of just passing the result, and I’m sure other languages can pull off the same thing with compile-time computation, just probably AOT.

adienes · April 10, 2025, 2:54pm

I don’t really understand how there are not one, but four, languages above C on a simple loop. (and one of them being java ..?)

Benny · April 10, 2025, 3:09pm

I’d chalk a 0.07ms discrepancy out of 63ms to OS jitter, it’s not a real-time environment as acknowledged.

Paul_Schrimpf · April 10, 2025, 5:51pm

I suspect most of the difference is due to the 32 vs 64 bit integers. I changed the function to use the type of its input throughout.

function loops(u::T)::T where {T}
    a = zeros(T, 10^4)          # Allocate an array of 10,000 zeros
    r = rand(T(1):T(10^4))              # Choose a random index between 1 and 10,000
    @inbounds for i in T(1):T(10^4)     # Outer loop over array indices
        @inbounds for j in T(1):T(10^4) # Inner loop: 10,000 iterations per outer loop iteration
            a[i] += j % u         # Simple sum
        end
        a[i] += r                 # Add a random value to each element in array
    end
    return @inbounds a[r]                   # Return the element at the random index
end

On an older Intel cpu, I get

julia> versioninfo()
Julia Version 1.11.3
Commit d63adeda50d (2025-01-21 19:42 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, ivybridge)
Threads: 4 default, 0 interactive, 2 GC (on 4 virtual cores)
Environment:
  JULIA_NUM_THREADS = 4

julia> @benchmark loops(10)
BenchmarkTools.Trial: 16 samples with 1 evaluation per sample.
 Range (min … max):  326.743 ms … 335.066 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     328.532 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   328.833 ms ±   1.961 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁ ▁ ▁▁  █  ▁▁█ ▁▁   ▁  ▁    ▁                               ▁  
  █▁█▁██▁▁█▁▁███▁██▁▁▁█▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  327 ms           Histogram: frequency by time          335 ms <

 Memory estimate: 78.16 KiB, allocs estimate: 2.

julia> @benchmark loops(Int32(10))
BenchmarkTools.Trial: 24 samples with 1 evaluation per sample.
 Range (min … max):  211.762 ms … 217.078 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     213.031 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   213.261 ms ±   1.170 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▃█   ▃     ▃              ▃                           
  ▇▇▁▇▁▇▁▇▁██▁▇▁█▇▇▁▇▇█▁▁▁▁▁▁▇▁▁▇▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  212 ms           Histogram: frequency by time          217 ms <

 Memory estimate: 39.09 KiB, allocs estimate: 2.

The Int32 benchmark is pretty much equal to the C benchmark.

[loops] $ gcc benchmark.c loops.c -lm -O3 
[loops] $ ./a.out 2000 3000 10
...
..
240.228608,2.209823,238.869205,246.366234,9,54136

There seems to be some CPU dependence. I also ran things on a much newer AMD system, and got the same time for Int32 and Int64, both of which match the C version.

julia> versioninfo()
Julia Version 1.11.4
Commit 8561cc3d68d (2025-03-10 11:36 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 9900X 12-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, generic)
Threads: 12 default, 0 interactive, 6 GC (on 24 virtual cores)
Environment:
  JULIA_NUM_THREADS = 12

julia> @benchmark loops(10)
BenchmarkTools.Trial: 47 samples with 1 evaluation per sample.
 Range (min … max):  107.854 ms … 108.336 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     107.926 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   107.958 ms ± 103.468 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █ █ ▄ █▄▄█▁    ▄ ▁
  █▁█▆█▆█████▆▆▁▆█▆█▁▆▁▆▆▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁▆▁▆▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  108 ms           Histogram: frequency by time          108 ms <

 Memory estimate: 78.16 KiB, allocs estimate: 2.

julia> @benchmark loops(Int32(10))
BenchmarkTools.Trial: 47 samples with 1 evaluation per sample.
 Range (min … max):  107.177 ms … 108.378 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     107.295 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   107.338 ms ± 193.199 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     █▂ ▅
  ▅▄▅██████▄▄▄▁▄▁▁▄▄▁▁▁▁▁▁▁▁▁▄▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
  107 ms           Histogram: frequency by time          108 ms <

 Memory estimate: 39.09 KiB, allocs estimate: 2.

[loops] $ gcc benchmark.c loops.c -O3 -lm
[loops] $ ./a.out 2000 2000 10
..
..
106.522007,0.079939,106.462386,106.836421,19,54333

Removing allocations as in the loops_noalloc function below makes little difference (results not shown).

using StaticArrays, LoopVectorization

function loops_noalloc(u::T)::T where {T}
    a = @MVector zeros(T, 10^4)          # Allocate an array of 10,000 zeros
    r = rand(T(1):T(10^4))              # Choose a random index between 1 and 10,000
    @inbounds for i in T(1):T(10^4)     # Outer loop over array indices
        @inbounds for j in T(1):T(10^4) # Inner loop: 10,000 iterations per outer loop iteration
            a[i] += j % u         # Simple sum
        end
        a[i] += r                 # Add a random value to each element in array
    end
    return @inbounds a[r]                   # Return the element at the random index
end


function loops_fast(u::T)::T where {T}
  a = @MVector zeros(T, 10^4) # Allocate an array of 10,000 zeros
  r = rand(T(1):T(10^4))              # Choose a random index between 1 and 10,000
  @turbo for i in T(1):T(10000) # Outer loop over array indices
    for j in T(1):T(10000) # Inner loop: 10,000 iterations per outer loop iteration
      a[i] += j % u         # Simple sum
    end
    a[i] = a[i] + r                 # Add a random value to each element in array
  end
  return @inbounds a[r]                   # Return the element at the random index
end

However, you can make things much faster by using LoopVectorization (admittedly, it’s questionable whether this is in the spirit of the original language comparison benchmark). On my newer AMD machine, I get

julia> @benchmark loops_fast(Int64(10))
BenchmarkTools.Trial: 2986 samples with 1 evaluation per sample.
 Range (min … max):  1.672 ms … 1.775 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.674 ms             ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.674 ms ± 4.688 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅▇▄▇█▆▄▂▁▁                                                ▁
  ███████████▇▅▇▆▇▆▆▅▆▅▆▆▆▅▄▅▅▅▃▆▃▁▅▄▄▅▄▅▆▄▅▄▁▄▁▆▅▁▄▅▆▆▃▅▅▆ █
  1.67 ms     Histogram: log(frequency) by time      1.7 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark loops_fast(Int32(10))
BenchmarkTools.Trial: 5926 samples with 1 evaluation per sample.
 Range (min … max):  841.047 μs … 1.289 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     842.871 μs             ┊ GC (median):    0.00%
 Time  (mean ± σ):   843.270 μs ± 8.455 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄▄▅▂▂▂▇▇█▅▄▃▂▁                                              ▂
  ████████████████▇▇▇▆▅▇▆▅▆▆▆▅▆▅▆▆▅▆▅▆▅▇▅▆▇▇▆▆▆▄▅▁▄▅▄▃▃▅▃▄▁▄▅ █
  841 μs       Histogram: log(frequency) by time       855 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Benny · April 10, 2025, 6:44pm

I think I misread the Rust (I don’t use it), another file specifies u32 for array elements. If I did know Rust, I’d have matched the numeric types and compared the LLVM IR.

Happened to me too, but on Intel (i7-1065G7). The 1.5x difference for your older Intel CPU makes sense for a 2x change in numeric type size, but the 3.25x difference on the benchmark’s M1 Max (also now noticing the M4 Max option with a 2.5x difference) surprises me.

PetrKryslUCSD · June 14, 2025, 6:17pm

Isn’t that 650 ms vs 2200 ms?

PetrKryslUCSD · June 14, 2025, 6:55pm

Really?

Tortar · June 14, 2025, 10:08pm

I don’t really buy this benchmark, it is clear that things are not correctly measured by looking at those results, I liked GitHub - attractivechaos/plb2: A programming language benchmark instead

acxz · June 15, 2025, 12:13pm

We also have our own here: GitHub - JuliaLang/Microbenchmarks: Microbenchmarks comparing the Julia Programming language with other languages
Although it is a bit outdated and could use an update.

Topic		Replies	Views
Benchmark for latest julia? Community question	126	14810	April 1, 2019
Benchmarks game Performance	20	3794	May 13, 2020
Benchmark game challenge and some optimization questions Performance	29	2804	January 13, 2024
Does Debian's BenchmarkGames show representative performance? Community benchmark	40	2884	August 18, 2022
Programming Language Benchmark 2 Performance	25	3413	April 8, 2024

Yet another language benchmark

Related topics