[YouTube/GitHub] What is the FASTEST Computer Language? 45 Languages Tested

essenciary · July 12, 2021, 1:37pm

A pretty cool project which includes Julia:

blackeneth · July 13, 2021, 3:35pm

The Julia code he’s running is here

He spells out the rules for the code in episode 2. Single threaded and prefers result as a bit vector. Algorithm faithful to the C algorithm.

SIMD would be fair game, though.

louie4825 · July 21, 2021, 2:25am

Just a little update, we seem to have three solutions now for Julia, and it would be great if anyone else could come and take a look, maybe help improve the current solutions! Let’s see how close Julia can get to the current kings (probably C, surprisingly Zig)

Full disclosure: I am the “author” of solution_3, which uses @simd, bitwise operations, and other tricks similar to PrimeC/sieve_1of2.c, which I based that solution on.

Before solution_1 was updated, here is a small snippet from a benchmark run on my machine (Intel Core i5-9300H, 24GB RAM, Docker under Ubuntu 20.04 under WSL under Windows 10 Home):

After solution_1 was updated, it seems to perform a bit slower than solution_2 (I unfortunately do not have a full benchmark run of this yet):

Primes/PrimeJulia/solution_1 on  drag-race via ஃ v1.6.2 took 5s
❯ julia PrimeSieveJulia.jl
true
Passes: 5744, Time: 5.000568866729736, Avg: 0.0008705725742913887, Limit: 1000000, Count: 78498, Valid: true

dcbi;5744;5.000568866729736;1;algorithm=base,faithful=yes,bits=1

Another snippet of my benchmark runs (filtered for base-faithful-1bit solutions):

louie4825 · July 21, 2021, 2:27am

On a related note, in this inner loop in solution_3:

    @simd for index in _div2(factor * factor):factor:max_index
        unsafe_set_bit_at_index!(arr, index)
    end

It seems that @simd can actually make the code slower on older machines (such as an older Windows 10 PC or the Raspberry Pi). Could anyone explain why that is?

lungben · July 21, 2021, 6:10am

How significant is compile-time for the Julia results? Probably compile-time is included for Julia, but not for C, Rust, etc.
To make a fair comparison, the Julia code should be called once before the measurement starts. Alternatively (if this is not allowed), a sysimage could be created in advance.

Is it possible to speed up calculation further with LoopVectorization.jl?

louie4825 · July 21, 2021, 6:50am

I can’t speak for the other solutions, but I tried to take compile-time into account for the Julia results by calling the run_sieve! function once before the main timed code, but I might not have fully accounted for it.

I did try to create a custom sysimage and try package precompilation, but the sysimage took forever to create and it didn’t seem to affect the results much. Package precompilation also seemed to have little effect, maybe just making the results a bit more consistent? I’m not sure.

Is it possible to speed up calculation further with LoopVectorization.jl?

Maybe, but one of the rules is that there should be no external dependencies to actually calculate the sieve. Plus, bit arrays need to make sure that no two chunks are accessed at the same time. I think you can implement that, though, by checking that the step in the factor clearing loop is greater than one chunk.

Pull requests are really welcome in this regard.

louie4825 · July 21, 2021, 7:58am

Actually, could StaticArrays help the performance here, as well as LoopVectorization?

louie4825 · August 2, 2021, 3:26am

Actually, I’ve tried LoopVectorization, and it seems really finicky. I have to avoid StepRanges and calculate the index with each iteration. I also have to remove the functions and manually inline the code since I think there’s some type changes that mess with the function unsafe_set_bit_at_index!.

julia> include("primes_1of2.jl")

julia> using LoopVectorization

julia> function clear_factors_turbo!(arr::Vector{UInt}, factor_index::Integer, max_index::Integer)
           factor = _map_to_factor(factor_index)
           factor < _uint_bit_length && error("Factor must be greater than UInt bit length to avoid memory dependencies")
           start_index = _div2(factor * factor)
           iterations = cld(max_index - start_index, factor) - 1
           @turbo for i in 0:iterations
               index = start_index + (factor * i)
               @inbounds arr[(index + 63) >> _div_uint_size_shift] |= 1 << ((index - 1) & 63)
           end
           return arr
       end

Although, it doesn’t seem particularly fast:

using BenchmarkTools
const x = zeros(UInt, _get_num_uints(1_000_000))

julia> @benchmark clear_factors!(x, 102, 500_000)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.200 μs …  48.680 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.270 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.360 μs ± 749.920 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆▅▆▆▇▃▅▂                                                   ▂
  █████████▇▆▁▅▅▄▄▄▄▃▃▄▅▄▄▃▃▅▄▄▄▆▅▆▇▇█▇█▇▇▇██▇█▇▇▆▇▆▆▇▆▆▆▅▆▇▆ █
  1.2 μs       Histogram: log(frequency) by time      2.71 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark clear_factors_turbo!(x, 102, 500_000)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  2.010 μs … 64.220 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.110 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.570 μs ±  1.544 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆▄▃▃▂▂▂▂▂▂▂▂▂▂▁▁                                          ▁
  █████████████████████▇▇▇▆▇▇▆▇▆▆▆▆▆▅▅▆▅▅▅▄▅▅▃▄▄▄▄▂▃▃▃▃▃▃▂▂▄ █
  210 μs       Histogram: log(frequency) by time     7.66 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

I’m running out of clues as to how to speed the code up. It’s as close as it can get for the C implementation, but falls a bit behind the new “striped” Rust implementation:

Primes on  drag-race [?] took 6s
❯ julia PrimeJulia/solution_3/primes_1of2.jl
Settings: sieve_size = 1000000 | duration = 5
Number of trues: 78498
primes_1of2.jl: Passes: 9580 | Elapsed: 5.00019097328186 | Passes per second: 1915.926821833406 | Average pass duration: 0.0005219406026390251
louie-github_port_1of2;9580;5.00019097328186;1;algorithm=base,faithful=yes,bits=1

Primes on  drag-race [?] took 5s
❯ ./PrimeC/solution_2/sieve_1of2
danielspaangberg_1of2;10373;5.000004;1;algorithm=base,faithful=yes,bits=1

Primes/PrimeRust/solution_1 on  drag-race is 📦 v0.1.0 via 🦀 v1.54.0 took 28s
❯ cargo run --release
...

Computing primes to 1000000 on 1 thread for 5 seconds.
byte-storage    Passes: 7777, Threads: 1, Time: 5.0005636215, Average: 0.0006429939, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_byte-storage;7777;5.0005636215;1;algorithm=base,faithful=yes,bits=8

Computing primes to 1000000 on 1 thread for 5 seconds.
bit-storage     Passes: 7597, Threads: 1, Time: 5.0007038116, Average: 0.0006582472, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_bit-storage;7597;5.0007038116;1;algorithm=base,faithful=yes,bits=1

Computing primes to 1000000 on 1 thread for 5 seconds.
bit-storage-rotate Passes: 8043, Threads: 1, Time: 5.0002326965, Average: 0.0006216875, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_bit-storage-rotate;8043;5.0002326965;1;algorithm=base,faithful=yes,bits=1

Computing primes to 1000000 on 1 thread for 5 seconds.
bit-storage-striped Passes: 12818, Threads: 1, Time: 5.0003485680, Average: 0.0003901036, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_bit-storage-striped;12818;5.0003485680;1;algorithm=base,faithful=yes,bits=1

Computing primes to 1000000 on 1 thread for 5 seconds.
bit-storage-striped-blocks Passes: 17255, Threads: 1, Time: 5.0003576279, Average: 0.0002897918, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_bit-storage-striped-blocks;17255;5.0003576279;1;algorithm=base,faithful=yes,bits=1
bit-storage-striped-blocks-small Passes: 16040, Threads: 1, Time: 5.0003647804, Average: 0.0003117434, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_bit-storage-striped-blocks-small;16040;5.0003647804;1;algorithm=base,faithful=yes,bits=1

...

Elrod · August 2, 2021, 4:57am

You shouldn’t have to. This isn’t especially well tested, but there are a couple. If it doesn’t work, that is a bug. Please file an issue.

Give me an example I can copy and paste to run (either here or as an issue at LoopVectorization), and I’ll look at it.
However, this line:
arr[(index + 63) >> _div_uint_size_shift] |=
would require both a gather and a scatter.
Without AVX512, the scatter is done by breaking up the vector into a series of scalar operations, which is slower than a scatter, and also slower than scalar operations without the need for extracting elements from a vector.

Given that the actual amount of computation is very small, my guess is that not being SIMD is fvaster, and that clear_factors! isn’t SIMD.
But I’d have to look at the code, and having something I can copy/paste would make that a lot easier for me.

louie4825 · August 2, 2021, 5:53am

I didn’t actually expect LoopVectorization to make the code any faster since it isn’t very conducive to SIMD, as you said. It has a large stride between array elements. Here’s the code I used to test it, similar to what I posted before:

using BenchmarkTools
using LoopVectorization

# Auxillary functions
begin
const _uint_bit_length = sizeof(UInt) * 8
const _div_uint_size_shift = Int(log2(_uint_bit_length))
@inline _mul2(i::Integer) = i << 1
@inline _div2(i::Integer) = i >> 1
@inline _map_to_index(i::Integer) = _div2(i - 1)
@inline _map_to_factor(i::Integer) = _mul2(i) + 1
@inline _mod_uint_size(i::Integer) = i & (_uint_bit_length - 1)
@inline _div_uint_size(i::Integer) = i >> _div_uint_size_shift
@inline _get_chunk_index(i::Integer) = _div_uint_size(i + (_uint_bit_length - 1))
@inline _get_bit_index_mask(i::Integer) = UInt(1) << _mod_uint_size(i - 1)
end

# Main code
function clear_factors_while!(arr::Vector{UInt}, factor_index::Integer, max_index::Integer)
    factor = _map_to_factor(factor_index)
    index = _div2(factor * factor)
    while index <= max_index
        @inbounds arr[_get_chunk_index(index)] |= _get_bit_index_mask(index)
        index += factor
    end
    return arr
end

function clear_factors_simd!(arr::Vector{UInt}, factor_index::Integer, max_index::Integer)
    factor = _map_to_factor(factor_index)
    @simd for index in _div2(factor * factor):factor:max_index
        @inbounds arr[_get_chunk_index(index)] |= _get_bit_index_mask(index)
    end
    return arr
end

function clear_factors_turbo!(arr::Vector{UInt}, factor_index::Integer, max_index::Integer)
    factor = _map_to_factor(factor_index)
    factor < _uint_bit_length && error("Factor must be greater than UInt bit length to avoid memory dependencies")
    start_index = _div2(factor * factor)
    iterations = cld(max_index - start_index, factor) - 1
    @turbo for i in 0:iterations
        index = start_index + (factor * i)
        @inbounds arr[(index + 63) >> _div_uint_size_shift] |= 1 << ((index - 1) & 63)
    end
    return arr
end

println(
    clear_factors_while!(zeros(UInt, cld(500_000, _uint_bit_length)), 202, 500_000) ==
    clear_factors_simd!(zeros(UInt, cld(500_000, _uint_bit_length)), 202, 500_000) ==
    clear_factors_turbo!(zeros(UInt, cld(500_000, _uint_bit_length)), 202, 500_000) ==
)

const x = zeros(UInt, cld(500_000, sizeof(UInt) * 8))
@benchmark clear_factors_while!(x, 202, 500_000)
@benchmark clear_factors_simd!(x, 202, 500_000)
@benchmark clear_factors_turbo!(x, 202, 500_000)

louie4825 · August 2, 2021, 5:55am

Changing the function I defined above to use a StepRange causes this to happen:

function clear_factors_turbo!(arr::Vector{UInt}, factor_index::Integer, max_index::Integer)
    factor = _map_to_factor(factor_index)
    factor < _uint_bit_length && error("Factor must be greater than UInt bit length to avoid memory pendencies")
    @turbo for index in _div2(factor * factor):factor:max_index
        @inbounds arr[(index + 63) >> _div_uint_size_shift] |= 1 << ((index - 1) & 63)
    end
    return arr
end

julia> clear_factors_turbo!(x, 202, 500_000)
ERROR: MethodError: no method matching vmul_nsw(::Static.StaticInt{4})
Closest candidates are:
  vmul_nsw(::Static.StaticInt{M}, ::Static.StaticInt{N}) where {M, N} at /home/louie/.julia/packages/VectorizationBase/PFhof/src/static.jl:37
  vmul_nsw(::Static.StaticInt{M}, ::T) where {M, T<:Union{Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8}} at /home/louie/.julia/packages/VectorizationBase/PFhof/src/static.jl:40
  vmul_nsw(::Static.StaticInt{M}, ::VectorizationBase.MM{W, X, I} where I<:Union{Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8, Static.StaticInt}) where {M, W, X} at /home/louie/.julia/packages/VectorizationBase/PFhof/src/vector_width.jl:56
  ...
Stacktrace:
 [1] macro expansion
   @ ~/.julia/packages/LoopVectorization/XZpZE/src/reconstruct_loopset.jl:711 [inlined]
 [2] _turbo_!
   @ ~/.julia/packages/LoopVectorization/XZpZE/src/reconstruct_loopset.jl:711 [inlined]
 [3] clear_factors_turbo!(arr::Vector{UInt64}, factor_index::Int64, max_index::Int64)
   @ Main ./REPL[13]:4
 [4] top-level scope
   @ REPL[17]:1

blackeneth · August 2, 2021, 3:56pm

The rules he spelled out in episode 2 specified single thread code only. Unless he’s changed it since then.

Elrod · August 3, 2021, 7:35am

That now works on the latest release, and I added it to the tests.

Here are my benchmark results on a laptop with AVX512:

julia> @benchmark clear_factors_while!(x, 202, 500_000)
BenchmarkTools.Trial: 10000 samples with 194 evaluations.
 Range (min … max):  504.907 ns …  1.173 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     517.402 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   515.285 ns ± 15.055 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▆▆        ▂█▄
  ▄██▃▂▃▃▂▂▂▂███▂▃▄▄▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  505 ns          Histogram: frequency by time          568 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark clear_factors_simd!(x, 202, 500_000)
BenchmarkTools.Trial: 10000 samples with 175 evaluations.
 Range (min … max):  624.531 ns …  1.735 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     627.411 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   633.860 ns ± 22.444 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▇ ▃▄▂  ▁▁▅▆▃▁▃▂▁                                         ▁ ▂
  ██████████████████▇▆▆▅▆▄▄▃▆▄▅▄▆▃▃▄▅▅▄▄▃▃▁▄▅▄▄▃▅▄▃▁▃▃▁▄▆▃▃▄▇█ █
  625 ns        Histogram: log(frequency) by time       704 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark clear_factors_turbo!(x, 202, 500_000)
BenchmarkTools.Trial: 10000 samples with 194 evaluations.
 Range (min … max):  498.041 ns …  3.463 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     502.067 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   506.635 ns ± 31.779 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▅██▆▂▃▄▂▁ ▂▅▆▅▂▁▂▂▁                                       ▁ ▂
  ██████████▇█████████▇▆▆▆▅▄▅▄▅▄▄▅▅▆▅▄▅▄▅▅▄▄▄▅▄▄▃▅▅▅▅▃▁▄▃▅▄▅██ █
  498 ns        Histogram: log(frequency) by time       563 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark clear_factors_turbo_steprange!(x, 202, 500_000)
BenchmarkTools.Trial: 10000 samples with 193 evaluations.
 Range (min … max):  505.534 ns …  1.396 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     508.373 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   511.387 ns ± 20.638 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃██▄▂▄▃                                                      ▂
  ████████▆▇▇▆▇▄▆▄▅▅▅▆▅▅▅▄▄▄▅▄▄▄▄▅▄▄▄▄▄▅▄▅▅▄▄▁▃▃▄▅▅▇█▆▅▄▄▃▄▄▅▄ █
  506 ns        Histogram: log(frequency) by time       583 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Linux perf results for each:

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(clear_factors_while!, 1_000_000, x, 202, 500_000)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               2.50e+09  100.0%  #  4.7 cycles per ns
└ task-clock               5.35e+08  100.0%  # 534.8 ms
┌ instructions             7.24e+09  100.0%  #  2.9 insns per cycle
│ branch-instructions      1.03e+09  100.0%  # 14.3% of instructions
└ branch-misses            1.00e+06  100.0%  #  0.1% of branch instructions
┌ L1-dcache-load-misses    6.40e+08  100.0%  # 61.9% of dcache loads
│ L1-dcache-loads          1.03e+09  100.0%
│ cache-misses             9.33e+02  100.0%  # 89.5% of cache references
└ cache-references         1.04e+03  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(clear_factors_simd!, 1_000_000, x, 202, 500_000)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               2.28e+09  100.0%  #  4.7 cycles per ns
└ task-clock               4.88e+08  100.0%  # 487.9 ms
┌ instructions             7.34e+09  100.0%  #  3.2 insns per cycle
│ branch-instructions      1.06e+09  100.0%  # 14.4% of instructions
└ branch-misses            1.00e+06  100.0%  #  0.1% of branch instructions
┌ L1-dcache-load-misses    6.42e+08  100.0%  # 61.5% of dcache loads
│ L1-dcache-loads          1.04e+09  100.0%
│ cache-misses             1.02e+03  100.0%  # 82.7% of cache references
└ cache-references         1.23e+03  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(clear_factors_turbo!, 1_000_000, x, 202, 500_000)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               2.30e+09  100.0%  #  4.6 cycles per ns
└ task-clock               5.03e+08  100.0%  # 502.5 ms
┌ instructions             2.02e+09  100.0%  #  0.9 insns per cycle
│ branch-instructions      7.20e+07  100.0%  #  3.6% of instructions
└ branch-misses            1.54e+03  100.0%  #  0.0% of branch instructions
┌ L1-dcache-load-misses    6.68e+08  100.0%  # 480.5% of dcache loads
│ L1-dcache-loads          1.39e+08  100.0%
│ cache-misses             9.03e+02  100.0%  # 82.6% of cache references
└ cache-references         1.09e+03  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(clear_factors_turbo_steprange!, 1_000_000, x, 202, 500_000)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               2.34e+09  100.0%  #  4.6 cycles per ns
└ task-clock               5.12e+08  100.0%  # 512.4 ms
┌ instructions             2.13e+09  100.0%  #  0.9 insns per cycle
│ branch-instructions      1.62e+08  100.0%  #  7.6% of instructions
└ branch-misses            5.57e+03  100.0%  #  0.0% of branch instructions
┌ L1-dcache-load-misses    6.65e+08  100.0%  # 429.3% of dcache loads
│ L1-dcache-loads          1.55e+08  100.0%
│ cache-misses             1.13e+03  100.0%  # 73.4% of cache references
└ cache-references         1.54e+03  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

With @benchmark, @simd was the slowest, but when timing running it a million times it was the fastest.

As we expected, the @turbo versions executed many less instructions – just over 2 billion vs over 7 billion for the other two. However, some of the instructions executed (gather and scatter) are slow, and thus this didn’t translate to an actual advantage in performance.
The turbo version also naturally had many fewer branches, but that wasn’t really a factor either.
Also, lots of cache misses.

x86_64 cpus without AVX512 don’t have scatter instructions, so they wouldn’t have the advantage in fewer instructions executed.

LoopVectorization isn’t unrolling the steprange version by default. Unrolling more doesn’t really seem to help performance, e.g. unrolling 4x:

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(clear_factors_turbo_steprange_u4!, 1_000_000, x, 202, 500_000)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               2.29e+09  100.0%  #  4.6 cycles per ns
└ task-clock               5.02e+08  100.0%  # 501.6 ms
┌ instructions             1.73e+09  100.0%  #  0.8 insns per cycle
│ branch-instructions      6.80e+07  100.0%  #  3.9% of instructions
└ branch-misses            2.89e+03  100.0%  #  0.0% of branch instructions
┌ L1-dcache-load-misses    6.55e+08  100.0%  # 419.9% of dcache loads
│ L1-dcache-loads          1.56e+08  100.0%
│ cache-misses             9.92e+02  100.0%  # 82.7% of cache references
└ cache-references         1.20e+03  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The steprange version avoids the integer multiplication. SIMD integer multiplication is slow, but that doesn’t matter here since we’re totally bottlenecked on the memory accesses.

jroon · June 23, 2023, 6:32am

Updated video from Plummer:

Julia not featured in the top 5

AMJ · June 23, 2023, 6:53am

It doesn’t make sense how zig and rust is so much faster than C. Backend for most of these languages are llvm, faster ones should at least be on par with each other.

jlapeyre · June 23, 2023, 10:03pm

“by implementing the same algorithm … we’re able to measure just how performant the code generated by each language really is”

No arguing there. I guess I better brush up on Java and C++.

The video also supplies some really interesting facts about Java, C++, etc.

Henrique_Becker · June 23, 2023, 10:22pm

Unfortunately, this is not a very scientific video, there is not an analysis of the compiler flags, code, and the assembly generated so we understand what has happened here. For Zig to be above C then probably some big optimization like SIMD may have happened automatically and that is just it, not only this such benefit may change from computer architecture to computer architecture but also knowing that this is the difference you could probably just nudge the code in the other languages to make the same optimization.

Benny · June 23, 2023, 10:39pm

Presumably there’s a resource somewhere outlining what the algorithm standards are. I haven’t been able to find it, there’s no plainly given links on the github page. I’m sure there is one because submissions are marked faithful vs deviation among other details. I’m also fairly certain that submissions with third-party dependencies are not faithful, judging from the description of one of the Python submissions.

From the reports page, the recently fastest submission is a C++ deviation that heavily constexpr the work to evade the runtime benchmark. I don’t know the algorithm or C++ enough to really understand how much, but there are passes being counted during runtime.

The fastest (and faithful) Julia submission was written by a GordonBGood, who also implemented it with a handful of other languages a couple years ago. The different Julia submissions look very different so I’m not sure if the (faithful) submissions are restricted to a standard algorithm enough to serve as useful language benchmarks, as opposed to algorithm benchmarks.

albheim · June 24, 2023, 5:35am

Looking at the discussion under the clip on yt it seems they used a strange metric which benefitted the zig implentation sine it used all cores instead of all threads. Someone mentioned they did a similar change to the rust and then it was better than zig.

I have a vague memory that he goes into a bit more detail into what he wants the implementation to be in one of his earlier videos on the topic.

Topic		Replies	Views
A humble request/challenge Optimization (Mathematical)	5	1219	August 8, 2017
The Julia Challenge Community	33	3463	November 20, 2018
Julia programs now shown on benchmarks game website Community announcement	144	14183	December 3, 2019
Benchmarks game Performance	20	3886	May 13, 2020
Performance of Julia as a general-purpose programming language Performance	30	7744	November 19, 2017

[YouTube/GitHub] What is the FASTEST Computer Language? 45 Languages Tested

Related topics