A pretty cool project which includes Julia:
The Julia code heโs running is here
He spells out the rules for the code in episode 2. Single threaded and prefers result as a bit vector. Algorithm faithful to the C algorithm.
SIMD would be fair game, though.
Just a little update, we seem to have three solutions now for Julia, and it would be great if anyone else could come and take a look, maybe help improve the current solutions! Letโs see how close Julia can get to the current kings (probably C, surprisingly Zig)
Full disclosure: I am the โauthorโ of solution_3, which uses @simd
, bitwise operations, and other tricks similar to PrimeC/sieve_1of2.c, which I based that solution on.
Before solution_1 was updated, here is a small snippet from a benchmark run on my machine (Intel Core i5-9300H, 24GB RAM, Docker under Ubuntu 20.04 under WSL under Windows 10 Home):
After solution_1 was updated, it seems to perform a bit slower than solution_2 (I unfortunately do not have a full benchmark run of this yet):
Primes/PrimeJulia/solution_1 on ๎ drag-race via เฎ v1.6.2 took 5s
โฏ julia PrimeSieveJulia.jl
true
Passes: 5744, Time: 5.000568866729736, Avg: 0.0008705725742913887, Limit: 1000000, Count: 78498, Valid: true
dcbi;5744;5.000568866729736;1;algorithm=base,faithful=yes,bits=1
Another snippet of my benchmark runs (filtered for base-faithful-1bit solutions):
On a related note, in this inner loop in solution_3:
@simd for index in _div2(factor * factor):factor:max_index
unsafe_set_bit_at_index!(arr, index)
end
It seems that @simd
can actually make the code slower on older machines (such as an older Windows 10 PC or the Raspberry Pi). Could anyone explain why that is?
How significant is compile-time for the Julia results? Probably compile-time is included for Julia, but not for C, Rust, etc.
To make a fair comparison, the Julia code should be called once before the measurement starts. Alternatively (if this is not allowed), a sysimage could be created in advance.
Is it possible to speed up calculation further with LoopVectorization.jl?
I canโt speak for the other solutions, but I tried to take compile-time into account for the Julia results by calling the run_sieve!
function once before the main timed code, but I might not have fully accounted for it.
I did try to create a custom sysimage and try package precompilation, but the sysimage took forever to create and it didnโt seem to affect the results much. Package precompilation also seemed to have little effect, maybe just making the results a bit more consistent? Iโm not sure.
Is it possible to speed up calculation further with LoopVectorization.jl?
Maybe, but one of the rules is that there should be no external dependencies to actually calculate the sieve. Plus, bit arrays need to make sure that no two chunks are accessed at the same time. I think you can implement that, though, by checking that the step in the factor clearing loop is greater than one chunk.
Pull requests are really welcome in this regard.
Actually, could StaticArrays
help the performance here, as well as LoopVectorization
?
Actually, Iโve tried LoopVectorization
, and it seems really finicky. I have to avoid StepRanges and calculate the index with each iteration. I also have to remove the functions and manually inline the code since I think thereโs some type changes that mess with the function unsafe_set_bit_at_index!
.
julia> include("primes_1of2.jl")
julia> using LoopVectorization
julia> function clear_factors_turbo!(arr::Vector{UInt}, factor_index::Integer, max_index::Integer)
factor = _map_to_factor(factor_index)
factor < _uint_bit_length && error("Factor must be greater than UInt bit length to avoid memory dependencies")
start_index = _div2(factor * factor)
iterations = cld(max_index - start_index, factor) - 1
@turbo for i in 0:iterations
index = start_index + (factor * i)
@inbounds arr[(index + 63) >> _div_uint_size_shift] |= 1 << ((index - 1) & 63)
end
return arr
end
Although, it doesnโt seem particularly fast:
using BenchmarkTools
const x = zeros(UInt, _get_num_uints(1_000_000))
julia> @benchmark clear_factors!(x, 102, 500_000)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min โฆ max): 1.200 ฮผs โฆ 48.680 ฮผs โ GC (min โฆ max): 0.00% โฆ 0.00%
Time (median): 1.270 ฮผs โ GC (median): 0.00%
Time (mean ยฑ ฯ): 1.360 ฮผs ยฑ 749.920 ns โ GC (mean ยฑ ฯ): 0.00% ยฑ 0.00%
โโโ
โโโโโ
โ โ
โโโโโโโโโโโโโ
โ
โโโโโโโโ
โโโโโ
โโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโ
โโโ โ
1.2 ฮผs Histogram: log(frequency) by time 2.71 ฮผs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark clear_factors_turbo!(x, 102, 500_000)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min โฆ max): 2.010 ฮผs โฆ 64.220 ฮผs โ GC (min โฆ max): 0.00% โฆ 0.00%
Time (median): 2.110 ฮผs โ GC (median): 0.00%
Time (mean ยฑ ฯ): 2.570 ฮผs ยฑ 1.544 ฮผs โ GC (mean ยฑ ฯ): 0.00% ยฑ 0.00%
โโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโ
โ
โ
โโ
โ
โโโโโโโโโโโโโโโ โ
210 ฮผs Histogram: log(frequency) by time 7.66 ฮผs <
Memory estimate: 0 bytes, allocs estimate: 0.
Iโm running out of clues as to how to speed the code up. Itโs as close as it can get for the C implementation, but falls a bit behind the new โstripedโ Rust implementation:
Primes on ๎ drag-race [?] took 6s
โฏ julia PrimeJulia/solution_3/primes_1of2.jl
Settings: sieve_size = 1000000 | duration = 5
Number of trues: 78498
primes_1of2.jl: Passes: 9580 | Elapsed: 5.00019097328186 | Passes per second: 1915.926821833406 | Average pass duration: 0.0005219406026390251
louie-github_port_1of2;9580;5.00019097328186;1;algorithm=base,faithful=yes,bits=1
Primes on ๎ drag-race [?] took 5s
โฏ ./PrimeC/solution_2/sieve_1of2
danielspaangberg_1of2;10373;5.000004;1;algorithm=base,faithful=yes,bits=1
Primes/PrimeRust/solution_1 on ๎ drag-race is ๐ฆ v0.1.0 via ๐ฆ v1.54.0 took 28s
โฏ cargo run --release
...
Computing primes to 1000000 on 1 thread for 5 seconds.
byte-storage Passes: 7777, Threads: 1, Time: 5.0005636215, Average: 0.0006429939, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_byte-storage;7777;5.0005636215;1;algorithm=base,faithful=yes,bits=8
Computing primes to 1000000 on 1 thread for 5 seconds.
bit-storage Passes: 7597, Threads: 1, Time: 5.0007038116, Average: 0.0006582472, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_bit-storage;7597;5.0007038116;1;algorithm=base,faithful=yes,bits=1
Computing primes to 1000000 on 1 thread for 5 seconds.
bit-storage-rotate Passes: 8043, Threads: 1, Time: 5.0002326965, Average: 0.0006216875, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_bit-storage-rotate;8043;5.0002326965;1;algorithm=base,faithful=yes,bits=1
Computing primes to 1000000 on 1 thread for 5 seconds.
bit-storage-striped Passes: 12818, Threads: 1, Time: 5.0003485680, Average: 0.0003901036, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_bit-storage-striped;12818;5.0003485680;1;algorithm=base,faithful=yes,bits=1
Computing primes to 1000000 on 1 thread for 5 seconds.
bit-storage-striped-blocks Passes: 17255, Threads: 1, Time: 5.0003576279, Average: 0.0002897918, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_bit-storage-striped-blocks;17255;5.0003576279;1;algorithm=base,faithful=yes,bits=1
bit-storage-striped-blocks-small Passes: 16040, Threads: 1, Time: 5.0003647804, Average: 0.0003117434, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_bit-storage-striped-blocks-small;16040;5.0003647804;1;algorithm=base,faithful=yes,bits=1
...
You shouldnโt have to. This isnโt especially well tested, but there are a couple. If it doesnโt work, that is a bug. Please file an issue.
Give me an example I can copy and paste to run (either here or as an issue at LoopVectorization), and Iโll look at it.
However, this line:
arr[(index + 63) >> _div_uint_size_shift] |=
would require both a gather and a scatter.
Without AVX512, the scatter is done by breaking up the vector into a series of scalar operations, which is slower than a scatter, and also slower than scalar operations without the need for extracting elements from a vector.
Given that the actual amount of computation is very small, my guess is that not being SIMD is fvaster, and that clear_factors!
isnโt SIMD.
But Iโd have to look at the code, and having something I can copy/paste would make that a lot easier for me.
I didnโt actually expect LoopVectorization
to make the code any faster since it isnโt very conducive to SIMD, as you said. It has a large stride between array elements. Hereโs the code I used to test it, similar to what I posted before:
using BenchmarkTools
using LoopVectorization
# Auxillary functions
begin
const _uint_bit_length = sizeof(UInt) * 8
const _div_uint_size_shift = Int(log2(_uint_bit_length))
@inline _mul2(i::Integer) = i << 1
@inline _div2(i::Integer) = i >> 1
@inline _map_to_index(i::Integer) = _div2(i - 1)
@inline _map_to_factor(i::Integer) = _mul2(i) + 1
@inline _mod_uint_size(i::Integer) = i & (_uint_bit_length - 1)
@inline _div_uint_size(i::Integer) = i >> _div_uint_size_shift
@inline _get_chunk_index(i::Integer) = _div_uint_size(i + (_uint_bit_length - 1))
@inline _get_bit_index_mask(i::Integer) = UInt(1) << _mod_uint_size(i - 1)
end
# Main code
function clear_factors_while!(arr::Vector{UInt}, factor_index::Integer, max_index::Integer)
factor = _map_to_factor(factor_index)
index = _div2(factor * factor)
while index <= max_index
@inbounds arr[_get_chunk_index(index)] |= _get_bit_index_mask(index)
index += factor
end
return arr
end
function clear_factors_simd!(arr::Vector{UInt}, factor_index::Integer, max_index::Integer)
factor = _map_to_factor(factor_index)
@simd for index in _div2(factor * factor):factor:max_index
@inbounds arr[_get_chunk_index(index)] |= _get_bit_index_mask(index)
end
return arr
end
function clear_factors_turbo!(arr::Vector{UInt}, factor_index::Integer, max_index::Integer)
factor = _map_to_factor(factor_index)
factor < _uint_bit_length && error("Factor must be greater than UInt bit length to avoid memory dependencies")
start_index = _div2(factor * factor)
iterations = cld(max_index - start_index, factor) - 1
@turbo for i in 0:iterations
index = start_index + (factor * i)
@inbounds arr[(index + 63) >> _div_uint_size_shift] |= 1 << ((index - 1) & 63)
end
return arr
end
println(
clear_factors_while!(zeros(UInt, cld(500_000, _uint_bit_length)), 202, 500_000) ==
clear_factors_simd!(zeros(UInt, cld(500_000, _uint_bit_length)), 202, 500_000) ==
clear_factors_turbo!(zeros(UInt, cld(500_000, _uint_bit_length)), 202, 500_000) ==
)
const x = zeros(UInt, cld(500_000, sizeof(UInt) * 8))
@benchmark clear_factors_while!(x, 202, 500_000)
@benchmark clear_factors_simd!(x, 202, 500_000)
@benchmark clear_factors_turbo!(x, 202, 500_000)
Changing the function I defined above to use a StepRange causes this to happen:
function clear_factors_turbo!(arr::Vector{UInt}, factor_index::Integer, max_index::Integer)
factor = _map_to_factor(factor_index)
factor < _uint_bit_length && error("Factor must be greater than UInt bit length to avoid memory pendencies")
@turbo for index in _div2(factor * factor):factor:max_index
@inbounds arr[(index + 63) >> _div_uint_size_shift] |= 1 << ((index - 1) & 63)
end
return arr
end
julia> clear_factors_turbo!(x, 202, 500_000)
ERROR: MethodError: no method matching vmul_nsw(::Static.StaticInt{4})
Closest candidates are:
vmul_nsw(::Static.StaticInt{M}, ::Static.StaticInt{N}) where {M, N} at /home/louie/.julia/packages/VectorizationBase/PFhof/src/static.jl:37
vmul_nsw(::Static.StaticInt{M}, ::T) where {M, T<:Union{Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8}} at /home/louie/.julia/packages/VectorizationBase/PFhof/src/static.jl:40
vmul_nsw(::Static.StaticInt{M}, ::VectorizationBase.MM{W, X, I} where I<:Union{Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8, Static.StaticInt}) where {M, W, X} at /home/louie/.julia/packages/VectorizationBase/PFhof/src/vector_width.jl:56
...
Stacktrace:
[1] macro expansion
@ ~/.julia/packages/LoopVectorization/XZpZE/src/reconstruct_loopset.jl:711 [inlined]
[2] _turbo_!
@ ~/.julia/packages/LoopVectorization/XZpZE/src/reconstruct_loopset.jl:711 [inlined]
[3] clear_factors_turbo!(arr::Vector{UInt64}, factor_index::Int64, max_index::Int64)
@ Main ./REPL[13]:4
[4] top-level scope
@ REPL[17]:1
The rules he spelled out in episode 2 specified single thread code only. Unless heโs changed it since then.
That now works on the latest release, and I added it to the tests.
Here are my benchmark results on a laptop with AVX512:
julia> @benchmark clear_factors_while!(x, 202, 500_000)
BenchmarkTools.Trial: 10000 samples with 194 evaluations.
Range (min โฆ max): 504.907 ns โฆ 1.173 ฮผs โ GC (min โฆ max): 0.00% โฆ 0.00%
Time (median): 517.402 ns โ GC (median): 0.00%
Time (mean ยฑ ฯ): 515.285 ns ยฑ 15.055 ns โ GC (mean ยฑ ฯ): 0.00% ยฑ 0.00%
โโ โโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
505 ns Histogram: frequency by time 568 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark clear_factors_simd!(x, 202, 500_000)
BenchmarkTools.Trial: 10000 samples with 175 evaluations.
Range (min โฆ max): 624.531 ns โฆ 1.735 ฮผs โ GC (min โฆ max): 0.00% โฆ 0.00%
Time (median): 627.411 ns โ GC (median): 0.00%
Time (mean ยฑ ฯ): 633.860 ns ยฑ 22.444 ns โ GC (mean ยฑ ฯ): 0.00% ยฑ 0.00%
โโโ โโโ โโโ
โโโโโโ โ โ
โโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโ
โโโโโโ
โ
โโโโโโโ
โโโโ
โโโโโโโโโโโโโ โ
625 ns Histogram: log(frequency) by time 704 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark clear_factors_turbo!(x, 202, 500_000)
BenchmarkTools.Trial: 10000 samples with 194 evaluations.
Range (min โฆ max): 498.041 ns โฆ 3.463 ฮผs โ GC (min โฆ max): 0.00% โฆ 0.00%
Time (median): 502.067 ns โ GC (median): 0.00%
Time (mean ยฑ ฯ): 506.635 ns ยฑ 31.779 ns โ GC (mean ยฑ ฯ): 0.00% ยฑ 0.00%
โ
โโโโโโโโ โโ
โโ
โโโโโ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโ
โโ
โโ
โโโ
โ
โโ
โโ
โโ
โ
โโโโ
โโโโ
โ
โ
โ
โโโโโ
โโ
โโ โ
498 ns Histogram: log(frequency) by time 563 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark clear_factors_turbo_steprange!(x, 202, 500_000)
BenchmarkTools.Trial: 10000 samples with 193 evaluations.
Range (min โฆ max): 505.534 ns โฆ 1.396 ฮผs โ GC (min โฆ max): 0.00% โฆ 0.00%
Time (median): 508.373 ns โ GC (median): 0.00%
Time (mean ยฑ ฯ): 511.387 ns ยฑ 20.638 ns โ GC (mean ยฑ ฯ): 0.00% ยฑ 0.00%
โโโโโโโ โ
โโโโโโโโโโโโโโโโโ
โ
โ
โโ
โ
โ
โโโโ
โโโโโ
โโโโโโ
โโ
โ
โโโโโโโ
โ
โโโโ
โโโโโโ
โ โ
506 ns Histogram: log(frequency) by time 583 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Linux perf results for each:
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(clear_factors_while!, 1_000_000, x, 202, 500_000)
end
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ cpu-cycles 2.50e+09 100.0% # 4.7 cycles per ns
โ task-clock 5.35e+08 100.0% # 534.8 ms
โ instructions 7.24e+09 100.0% # 2.9 insns per cycle
โ branch-instructions 1.03e+09 100.0% # 14.3% of instructions
โ branch-misses 1.00e+06 100.0% # 0.1% of branch instructions
โ L1-dcache-load-misses 6.40e+08 100.0% # 61.9% of dcache loads
โ L1-dcache-loads 1.03e+09 100.0%
โ cache-misses 9.33e+02 100.0% # 89.5% of cache references
โ cache-references 1.04e+03 100.0%
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(clear_factors_simd!, 1_000_000, x, 202, 500_000)
end
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ cpu-cycles 2.28e+09 100.0% # 4.7 cycles per ns
โ task-clock 4.88e+08 100.0% # 487.9 ms
โ instructions 7.34e+09 100.0% # 3.2 insns per cycle
โ branch-instructions 1.06e+09 100.0% # 14.4% of instructions
โ branch-misses 1.00e+06 100.0% # 0.1% of branch instructions
โ L1-dcache-load-misses 6.42e+08 100.0% # 61.5% of dcache loads
โ L1-dcache-loads 1.04e+09 100.0%
โ cache-misses 1.02e+03 100.0% # 82.7% of cache references
โ cache-references 1.23e+03 100.0%
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(clear_factors_turbo!, 1_000_000, x, 202, 500_000)
end
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ cpu-cycles 2.30e+09 100.0% # 4.6 cycles per ns
โ task-clock 5.03e+08 100.0% # 502.5 ms
โ instructions 2.02e+09 100.0% # 0.9 insns per cycle
โ branch-instructions 7.20e+07 100.0% # 3.6% of instructions
โ branch-misses 1.54e+03 100.0% # 0.0% of branch instructions
โ L1-dcache-load-misses 6.68e+08 100.0% # 480.5% of dcache loads
โ L1-dcache-loads 1.39e+08 100.0%
โ cache-misses 9.03e+02 100.0% # 82.6% of cache references
โ cache-references 1.09e+03 100.0%
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(clear_factors_turbo_steprange!, 1_000_000, x, 202, 500_000)
end
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ cpu-cycles 2.34e+09 100.0% # 4.6 cycles per ns
โ task-clock 5.12e+08 100.0% # 512.4 ms
โ instructions 2.13e+09 100.0% # 0.9 insns per cycle
โ branch-instructions 1.62e+08 100.0% # 7.6% of instructions
โ branch-misses 5.57e+03 100.0% # 0.0% of branch instructions
โ L1-dcache-load-misses 6.65e+08 100.0% # 429.3% of dcache loads
โ L1-dcache-loads 1.55e+08 100.0%
โ cache-misses 1.13e+03 100.0% # 73.4% of cache references
โ cache-references 1.54e+03 100.0%
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
With @benchmark
, @simd
was the slowest, but when timing running it a million times it was the fastest.
As we expected, the @turbo
versions executed many less instructions โ just over 2 billion vs over 7 billion for the other two. However, some of the instructions executed (gather and scatter) are slow, and thus this didnโt translate to an actual advantage in performance.
The turbo version also naturally had many fewer branches, but that wasnโt really a factor either.
Also, lots of cache misses.
x86_64 cpus without AVX512 donโt have scatter instructions, so they wouldnโt have the advantage in fewer instructions executed.
LoopVectorization isnโt unrolling the steprange version by default. Unrolling more doesnโt really seem to help performance, e.g. unrolling 4x:
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(clear_factors_turbo_steprange_u4!, 1_000_000, x, 202, 500_000)
end
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ cpu-cycles 2.29e+09 100.0% # 4.6 cycles per ns
โ task-clock 5.02e+08 100.0% # 501.6 ms
โ instructions 1.73e+09 100.0% # 0.8 insns per cycle
โ branch-instructions 6.80e+07 100.0% # 3.9% of instructions
โ branch-misses 2.89e+03 100.0% # 0.0% of branch instructions
โ L1-dcache-load-misses 6.55e+08 100.0% # 419.9% of dcache loads
โ L1-dcache-loads 1.56e+08 100.0%
โ cache-misses 9.92e+02 100.0% # 82.7% of cache references
โ cache-references 1.20e+03 100.0%
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The steprange version avoids the integer multiplication. SIMD integer multiplication is slow, but that doesnโt matter here since weโre totally bottlenecked on the memory accesses.
It doesnโt make sense how zig and rust is so much faster than C. Backend for most of these languages are llvm, faster ones should at least be on par with each other.
โby implementing the same algorithm โฆ weโre able to measure just how performant the code generated by each language really isโ
No arguing there. I guess I better brush up on Java and C++.
The video also supplies some really interesting facts about Java, C++, etc.
Unfortunately, this is not a very scientific video, there is not an analysis of the compiler flags, code, and the assembly generated so we understand what has happened here. For Zig to be above C then probably some big optimization like SIMD may have happened automatically and that is just it, not only this such benefit may change from computer architecture to computer architecture but also knowing that this is the difference you could probably just nudge the code in the other languages to make the same optimization.
Presumably thereโs a resource somewhere outlining what the algorithm standards are. I havenโt been able to find it, thereโs no plainly given links on the github page. Iโm sure there is one because submissions are marked faithful vs deviation among other details. Iโm also fairly certain that submissions with third-party dependencies are not faithful, judging from the description of one of the Python submissions.
From the reports page, the recently fastest submission is a C++ deviation that heavily constexpr
the work to evade the runtime benchmark. I donโt know the algorithm or C++ enough to really understand how much, but there are passes being counted during runtime.
The fastest (and faithful) Julia submission was written by a GordonBGood, who also implemented it with a handful of other languages a couple years ago. The different Julia submissions look very different so Iโm not sure if the (faithful) submissions are restricted to a standard algorithm enough to serve as useful language benchmarks, as opposed to algorithm benchmarks.
Looking at the discussion under the clip on yt it seems they used a strange metric which benefitted the zig implentation sine it used all cores instead of all threads. Someone mentioned they did a similar change to the rust and then it was better than zig.
I have a vague memory that he goes into a bit more detail into what he wants the implementation to be in one of his earlier videos on the topic.