[YouTube/GitHub] What is the FASTEST Computer Language? 45 Languages Tested

A pretty cool project which includes Julia:

5 Likes

The Julia code heโ€™s running is here

He spells out the rules for the code in episode 2. Single threaded and prefers result as a bit vector. Algorithm faithful to the C algorithm.

SIMD would be fair game, though.

2 Likes

Just a little update, we seem to have three solutions now for Julia, and it would be great if anyone else could come and take a look, maybe help improve the current solutions! Letโ€™s see how close Julia can get to the current kings (probably C, surprisingly Zig)

Full disclosure: I am the โ€œauthorโ€ of solution_3, which uses @simd, bitwise operations, and other tricks similar to PrimeC/sieve_1of2.c, which I based that solution on.

Before solution_1 was updated, here is a small snippet from a benchmark run on my machine (Intel Core i5-9300H, 24GB RAM, Docker under Ubuntu 20.04 under WSL under Windows 10 Home):

After solution_1 was updated, it seems to perform a bit slower than solution_2 (I unfortunately do not have a full benchmark run of this yet):

Primes/PrimeJulia/solution_1 on ๎‚  drag-race via เฎƒ v1.6.2 took 5s
โฏ julia PrimeSieveJulia.jl
true
Passes: 5744, Time: 5.000568866729736, Avg: 0.0008705725742913887, Limit: 1000000, Count: 78498, Valid: true

dcbi;5744;5.000568866729736;1;algorithm=base,faithful=yes,bits=1

Another snippet of my benchmark runs (filtered for base-faithful-1bit solutions):

1 Like

On a related note, in this inner loop in solution_3:

    @simd for index in _div2(factor * factor):factor:max_index
        unsafe_set_bit_at_index!(arr, index)
    end

It seems that @simd can actually make the code slower on older machines (such as an older Windows 10 PC or the Raspberry Pi). Could anyone explain why that is?

How significant is compile-time for the Julia results? Probably compile-time is included for Julia, but not for C, Rust, etc.
To make a fair comparison, the Julia code should be called once before the measurement starts. Alternatively (if this is not allowed), a sysimage could be created in advance.

Is it possible to speed up calculation further with LoopVectorization.jl?

3 Likes

I canโ€™t speak for the other solutions, but I tried to take compile-time into account for the Julia results by calling the run_sieve! function once before the main timed code, but I might not have fully accounted for it.

I did try to create a custom sysimage and try package precompilation, but the sysimage took forever to create and it didnโ€™t seem to affect the results much. Package precompilation also seemed to have little effect, maybe just making the results a bit more consistent? Iโ€™m not sure.

Is it possible to speed up calculation further with LoopVectorization.jl?

Maybe, but one of the rules is that there should be no external dependencies to actually calculate the sieve. Plus, bit arrays need to make sure that no two chunks are accessed at the same time. I think you can implement that, though, by checking that the step in the factor clearing loop is greater than one chunk.

Pull requests are really welcome in this regard.

2 Likes

Actually, could StaticArrays help the performance here, as well as LoopVectorization?

1 Like

Actually, Iโ€™ve tried LoopVectorization, and it seems really finicky. I have to avoid StepRanges and calculate the index with each iteration. I also have to remove the functions and manually inline the code since I think thereโ€™s some type changes that mess with the function unsafe_set_bit_at_index!.

julia> include("primes_1of2.jl")

julia> using LoopVectorization

julia> function clear_factors_turbo!(arr::Vector{UInt}, factor_index::Integer, max_index::Integer)
           factor = _map_to_factor(factor_index)
           factor < _uint_bit_length && error("Factor must be greater than UInt bit length to avoid memory dependencies")
           start_index = _div2(factor * factor)
           iterations = cld(max_index - start_index, factor) - 1
           @turbo for i in 0:iterations
               index = start_index + (factor * i)
               @inbounds arr[(index + 63) >> _div_uint_size_shift] |= 1 << ((index - 1) & 63)
           end
           return arr
       end

Although, it doesnโ€™t seem particularly fast:

using BenchmarkTools
const x = zeros(UInt, _get_num_uints(1_000_000))

julia> @benchmark clear_factors!(x, 102, 500_000)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min โ€ฆ max):  1.200 ฮผs โ€ฆ  48.680 ฮผs  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     1.270 ฮผs               โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   1.360 ฮผs ยฑ 749.920 ns  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

  โ–ˆโ–†โ–…โ–†โ–†โ–‡โ–ƒโ–…โ–‚                                                   โ–‚
  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–†โ–โ–…โ–…โ–„โ–„โ–„โ–„โ–ƒโ–ƒโ–„โ–…โ–„โ–„โ–ƒโ–ƒโ–…โ–„โ–„โ–„โ–†โ–…โ–†โ–‡โ–‡โ–ˆโ–‡โ–ˆโ–‡โ–‡โ–‡โ–ˆโ–ˆโ–‡โ–ˆโ–‡โ–‡โ–†โ–‡โ–†โ–†โ–‡โ–†โ–†โ–†โ–…โ–†โ–‡โ–† โ–ˆ
  1.2 ฮผs       Histogram: log(frequency) by time      2.71 ฮผs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark clear_factors_turbo!(x, 102, 500_000)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min โ€ฆ max):  2.010 ฮผs โ€ฆ 64.220 ฮผs  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     2.110 ฮผs              โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   2.570 ฮผs ยฑ  1.544 ฮผs  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

  โ–ˆโ–†โ–„โ–ƒโ–ƒโ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–โ–                                          โ–
  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–‡โ–‡โ–†โ–‡โ–‡โ–†โ–‡โ–†โ–†โ–†โ–†โ–†โ–…โ–…โ–†โ–…โ–…โ–…โ–„โ–…โ–…โ–ƒโ–„โ–„โ–„โ–„โ–‚โ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–ƒโ–‚โ–‚โ–„ โ–ˆ
  210 ฮผs       Histogram: log(frequency) by time     7.66 ฮผs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Iโ€™m running out of clues as to how to speed the code up. Itโ€™s as close as it can get for the C implementation, but falls a bit behind the new โ€œstripedโ€ Rust implementation:

Primes on ๎‚  drag-race [?] took 6s
โฏ julia PrimeJulia/solution_3/primes_1of2.jl
Settings: sieve_size = 1000000 | duration = 5
Number of trues: 78498
primes_1of2.jl: Passes: 9580 | Elapsed: 5.00019097328186 | Passes per second: 1915.926821833406 | Average pass duration: 0.0005219406026390251
louie-github_port_1of2;9580;5.00019097328186;1;algorithm=base,faithful=yes,bits=1

Primes on ๎‚  drag-race [?] took 5s
โฏ ./PrimeC/solution_2/sieve_1of2
danielspaangberg_1of2;10373;5.000004;1;algorithm=base,faithful=yes,bits=1
Primes/PrimeRust/solution_1 on ๎‚  drag-race is ๐Ÿ“ฆ v0.1.0 via ๐Ÿฆ€ v1.54.0 took 28s
โฏ cargo run --release
...

Computing primes to 1000000 on 1 thread for 5 seconds.
byte-storage    Passes: 7777, Threads: 1, Time: 5.0005636215, Average: 0.0006429939, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_byte-storage;7777;5.0005636215;1;algorithm=base,faithful=yes,bits=8

Computing primes to 1000000 on 1 thread for 5 seconds.
bit-storage     Passes: 7597, Threads: 1, Time: 5.0007038116, Average: 0.0006582472, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_bit-storage;7597;5.0007038116;1;algorithm=base,faithful=yes,bits=1

Computing primes to 1000000 on 1 thread for 5 seconds.
bit-storage-rotate Passes: 8043, Threads: 1, Time: 5.0002326965, Average: 0.0006216875, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_bit-storage-rotate;8043;5.0002326965;1;algorithm=base,faithful=yes,bits=1

Computing primes to 1000000 on 1 thread for 5 seconds.
bit-storage-striped Passes: 12818, Threads: 1, Time: 5.0003485680, Average: 0.0003901036, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_bit-storage-striped;12818;5.0003485680;1;algorithm=base,faithful=yes,bits=1

Computing primes to 1000000 on 1 thread for 5 seconds.
bit-storage-striped-blocks Passes: 17255, Threads: 1, Time: 5.0003576279, Average: 0.0002897918, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_bit-storage-striped-blocks;17255;5.0003576279;1;algorithm=base,faithful=yes,bits=1
bit-storage-striped-blocks-small Passes: 16040, Threads: 1, Time: 5.0003647804, Average: 0.0003117434, Limit: 1000000, Counts: 78498, Valid: Pass
mike-barber_bit-storage-striped-blocks-small;16040;5.0003647804;1;algorithm=base,faithful=yes,bits=1

...
1 Like

You shouldnโ€™t have to. This isnโ€™t especially well tested, but there are a couple. If it doesnโ€™t work, that is a bug. Please file an issue.

Give me an example I can copy and paste to run (either here or as an issue at LoopVectorization), and Iโ€™ll look at it.
However, this line:
arr[(index + 63) >> _div_uint_size_shift] |=
would require both a gather and a scatter.
Without AVX512, the scatter is done by breaking up the vector into a series of scalar operations, which is slower than a scatter, and also slower than scalar operations without the need for extracting elements from a vector.

Given that the actual amount of computation is very small, my guess is that not being SIMD is fvaster, and that clear_factors! isnโ€™t SIMD.
But Iโ€™d have to look at the code, and having something I can copy/paste would make that a lot easier for me.

8 Likes

I didnโ€™t actually expect LoopVectorization to make the code any faster since it isnโ€™t very conducive to SIMD, as you said. It has a large stride between array elements. Hereโ€™s the code I used to test it, similar to what I posted before:

using BenchmarkTools
using LoopVectorization

# Auxillary functions
begin
const _uint_bit_length = sizeof(UInt) * 8
const _div_uint_size_shift = Int(log2(_uint_bit_length))
@inline _mul2(i::Integer) = i << 1
@inline _div2(i::Integer) = i >> 1
@inline _map_to_index(i::Integer) = _div2(i - 1)
@inline _map_to_factor(i::Integer) = _mul2(i) + 1
@inline _mod_uint_size(i::Integer) = i & (_uint_bit_length - 1)
@inline _div_uint_size(i::Integer) = i >> _div_uint_size_shift
@inline _get_chunk_index(i::Integer) = _div_uint_size(i + (_uint_bit_length - 1))
@inline _get_bit_index_mask(i::Integer) = UInt(1) << _mod_uint_size(i - 1)
end

# Main code
function clear_factors_while!(arr::Vector{UInt}, factor_index::Integer, max_index::Integer)
    factor = _map_to_factor(factor_index)
    index = _div2(factor * factor)
    while index <= max_index
        @inbounds arr[_get_chunk_index(index)] |= _get_bit_index_mask(index)
        index += factor
    end
    return arr
end

function clear_factors_simd!(arr::Vector{UInt}, factor_index::Integer, max_index::Integer)
    factor = _map_to_factor(factor_index)
    @simd for index in _div2(factor * factor):factor:max_index
        @inbounds arr[_get_chunk_index(index)] |= _get_bit_index_mask(index)
    end
    return arr
end

function clear_factors_turbo!(arr::Vector{UInt}, factor_index::Integer, max_index::Integer)
    factor = _map_to_factor(factor_index)
    factor < _uint_bit_length && error("Factor must be greater than UInt bit length to avoid memory dependencies")
    start_index = _div2(factor * factor)
    iterations = cld(max_index - start_index, factor) - 1
    @turbo for i in 0:iterations
        index = start_index + (factor * i)
        @inbounds arr[(index + 63) >> _div_uint_size_shift] |= 1 << ((index - 1) & 63)
    end
    return arr
end

println(
    clear_factors_while!(zeros(UInt, cld(500_000, _uint_bit_length)), 202, 500_000) ==
    clear_factors_simd!(zeros(UInt, cld(500_000, _uint_bit_length)), 202, 500_000) ==
    clear_factors_turbo!(zeros(UInt, cld(500_000, _uint_bit_length)), 202, 500_000) ==
)

const x = zeros(UInt, cld(500_000, sizeof(UInt) * 8))
@benchmark clear_factors_while!(x, 202, 500_000)
@benchmark clear_factors_simd!(x, 202, 500_000)
@benchmark clear_factors_turbo!(x, 202, 500_000)

Changing the function I defined above to use a StepRange causes this to happen:

function clear_factors_turbo!(arr::Vector{UInt}, factor_index::Integer, max_index::Integer)
    factor = _map_to_factor(factor_index)
    factor < _uint_bit_length && error("Factor must be greater than UInt bit length to avoid memory pendencies")
    @turbo for index in _div2(factor * factor):factor:max_index
        @inbounds arr[(index + 63) >> _div_uint_size_shift] |= 1 << ((index - 1) & 63)
    end
    return arr
end

julia> clear_factors_turbo!(x, 202, 500_000)
ERROR: MethodError: no method matching vmul_nsw(::Static.StaticInt{4})
Closest candidates are:
  vmul_nsw(::Static.StaticInt{M}, ::Static.StaticInt{N}) where {M, N} at /home/louie/.julia/packages/VectorizationBase/PFhof/src/static.jl:37
  vmul_nsw(::Static.StaticInt{M}, ::T) where {M, T<:Union{Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8}} at /home/louie/.julia/packages/VectorizationBase/PFhof/src/static.jl:40
  vmul_nsw(::Static.StaticInt{M}, ::VectorizationBase.MM{W, X, I} where I<:Union{Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8, Static.StaticInt}) where {M, W, X} at /home/louie/.julia/packages/VectorizationBase/PFhof/src/vector_width.jl:56
  ...
Stacktrace:
 [1] macro expansion
   @ ~/.julia/packages/LoopVectorization/XZpZE/src/reconstruct_loopset.jl:711 [inlined]
 [2] _turbo_!
   @ ~/.julia/packages/LoopVectorization/XZpZE/src/reconstruct_loopset.jl:711 [inlined]
 [3] clear_factors_turbo!(arr::Vector{UInt64}, factor_index::Int64, max_index::Int64)
   @ Main ./REPL[13]:4
 [4] top-level scope
   @ REPL[17]:1

The rules he spelled out in episode 2 specified single thread code only. Unless heโ€™s changed it since then.

That now works on the latest release, and I added it to the tests.

Here are my benchmark results on a laptop with AVX512:

julia> @benchmark clear_factors_while!(x, 202, 500_000)
BenchmarkTools.Trial: 10000 samples with 194 evaluations.
 Range (min โ€ฆ max):  504.907 ns โ€ฆ  1.173 ฮผs  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     517.402 ns              โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   515.285 ns ยฑ 15.055 ns  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

   โ–†โ–†        โ–‚โ–ˆโ–„
  โ–„โ–ˆโ–ˆโ–ƒโ–‚โ–ƒโ–ƒโ–‚โ–‚โ–‚โ–‚โ–ˆโ–ˆโ–ˆโ–‚โ–ƒโ–„โ–„โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚โ–‚ โ–ƒ
  505 ns          Histogram: frequency by time          568 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark clear_factors_simd!(x, 202, 500_000)
BenchmarkTools.Trial: 10000 samples with 175 evaluations.
 Range (min โ€ฆ max):  624.531 ns โ€ฆ  1.735 ฮผs  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     627.411 ns              โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   633.860 ns ยฑ 22.444 ns  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

  โ–„โ–ˆโ–‡ โ–ƒโ–„โ–‚  โ–โ–โ–…โ–†โ–ƒโ–โ–ƒโ–‚โ–                                         โ– โ–‚
  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–†โ–†โ–…โ–†โ–„โ–„โ–ƒโ–†โ–„โ–…โ–„โ–†โ–ƒโ–ƒโ–„โ–…โ–…โ–„โ–„โ–ƒโ–ƒโ–โ–„โ–…โ–„โ–„โ–ƒโ–…โ–„โ–ƒโ–โ–ƒโ–ƒโ–โ–„โ–†โ–ƒโ–ƒโ–„โ–‡โ–ˆ โ–ˆ
  625 ns        Histogram: log(frequency) by time       704 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark clear_factors_turbo!(x, 202, 500_000)
BenchmarkTools.Trial: 10000 samples with 194 evaluations.
 Range (min โ€ฆ max):  498.041 ns โ€ฆ  3.463 ฮผs  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     502.067 ns              โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   506.635 ns ยฑ 31.779 ns  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

   โ–…โ–ˆโ–ˆโ–†โ–‚โ–ƒโ–„โ–‚โ– โ–‚โ–…โ–†โ–…โ–‚โ–โ–‚โ–‚โ–                                       โ– โ–‚
  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‡โ–†โ–†โ–†โ–…โ–„โ–…โ–„โ–…โ–„โ–„โ–…โ–…โ–†โ–…โ–„โ–…โ–„โ–…โ–…โ–„โ–„โ–„โ–…โ–„โ–„โ–ƒโ–…โ–…โ–…โ–…โ–ƒโ–โ–„โ–ƒโ–…โ–„โ–…โ–ˆโ–ˆ โ–ˆ
  498 ns        Histogram: log(frequency) by time       563 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark clear_factors_turbo_steprange!(x, 202, 500_000)
BenchmarkTools.Trial: 10000 samples with 193 evaluations.
 Range (min โ€ฆ max):  505.534 ns โ€ฆ  1.396 ฮผs  โ”Š GC (min โ€ฆ max): 0.00% โ€ฆ 0.00%
 Time  (median):     508.373 ns              โ”Š GC (median):    0.00%
 Time  (mean ยฑ ฯƒ):   511.387 ns ยฑ 20.638 ns  โ”Š GC (mean ยฑ ฯƒ):  0.00% ยฑ 0.00%

  โ–ƒโ–ˆโ–ˆโ–„โ–‚โ–„โ–ƒ                                                      โ–‚
  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–†โ–‡โ–‡โ–†โ–‡โ–„โ–†โ–„โ–…โ–…โ–…โ–†โ–…โ–…โ–…โ–„โ–„โ–„โ–…โ–„โ–„โ–„โ–„โ–…โ–„โ–„โ–„โ–„โ–„โ–…โ–„โ–…โ–…โ–„โ–„โ–โ–ƒโ–ƒโ–„โ–…โ–…โ–‡โ–ˆโ–†โ–…โ–„โ–„โ–ƒโ–„โ–„โ–…โ–„ โ–ˆ
  506 ns        Histogram: log(frequency) by time       583 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Linux perf results for each:

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(clear_factors_while!, 1_000_000, x, 202, 500_000)
       end
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
โ”Œ cpu-cycles               2.50e+09  100.0%  #  4.7 cycles per ns
โ”” task-clock               5.35e+08  100.0%  # 534.8 ms
โ”Œ instructions             7.24e+09  100.0%  #  2.9 insns per cycle
โ”‚ branch-instructions      1.03e+09  100.0%  # 14.3% of instructions
โ”” branch-misses            1.00e+06  100.0%  #  0.1% of branch instructions
โ”Œ L1-dcache-load-misses    6.40e+08  100.0%  # 61.9% of dcache loads
โ”‚ L1-dcache-loads          1.03e+09  100.0%
โ”‚ cache-misses             9.33e+02  100.0%  # 89.5% of cache references
โ”” cache-references         1.04e+03  100.0%
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(clear_factors_simd!, 1_000_000, x, 202, 500_000)
       end
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
โ”Œ cpu-cycles               2.28e+09  100.0%  #  4.7 cycles per ns
โ”” task-clock               4.88e+08  100.0%  # 487.9 ms
โ”Œ instructions             7.34e+09  100.0%  #  3.2 insns per cycle
โ”‚ branch-instructions      1.06e+09  100.0%  # 14.4% of instructions
โ”” branch-misses            1.00e+06  100.0%  #  0.1% of branch instructions
โ”Œ L1-dcache-load-misses    6.42e+08  100.0%  # 61.5% of dcache loads
โ”‚ L1-dcache-loads          1.04e+09  100.0%
โ”‚ cache-misses             1.02e+03  100.0%  # 82.7% of cache references
โ”” cache-references         1.23e+03  100.0%
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(clear_factors_turbo!, 1_000_000, x, 202, 500_000)
       end
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
โ”Œ cpu-cycles               2.30e+09  100.0%  #  4.6 cycles per ns
โ”” task-clock               5.03e+08  100.0%  # 502.5 ms
โ”Œ instructions             2.02e+09  100.0%  #  0.9 insns per cycle
โ”‚ branch-instructions      7.20e+07  100.0%  #  3.6% of instructions
โ”” branch-misses            1.54e+03  100.0%  #  0.0% of branch instructions
โ”Œ L1-dcache-load-misses    6.68e+08  100.0%  # 480.5% of dcache loads
โ”‚ L1-dcache-loads          1.39e+08  100.0%
โ”‚ cache-misses             9.03e+02  100.0%  # 82.6% of cache references
โ”” cache-references         1.09e+03  100.0%
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(clear_factors_turbo_steprange!, 1_000_000, x, 202, 500_000)
       end
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
โ”Œ cpu-cycles               2.34e+09  100.0%  #  4.6 cycles per ns
โ”” task-clock               5.12e+08  100.0%  # 512.4 ms
โ”Œ instructions             2.13e+09  100.0%  #  0.9 insns per cycle
โ”‚ branch-instructions      1.62e+08  100.0%  #  7.6% of instructions
โ”” branch-misses            5.57e+03  100.0%  #  0.0% of branch instructions
โ”Œ L1-dcache-load-misses    6.65e+08  100.0%  # 429.3% of dcache loads
โ”‚ L1-dcache-loads          1.55e+08  100.0%
โ”‚ cache-misses             1.13e+03  100.0%  # 73.4% of cache references
โ”” cache-references         1.54e+03  100.0%
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”

With @benchmark, @simd was the slowest, but when timing running it a million times it was the fastest.

As we expected, the @turbo versions executed many less instructions โ€“ just over 2 billion vs over 7 billion for the other two. However, some of the instructions executed (gather and scatter) are slow, and thus this didnโ€™t translate to an actual advantage in performance.
The turbo version also naturally had many fewer branches, but that wasnโ€™t really a factor either.
Also, lots of cache misses.

x86_64 cpus without AVX512 donโ€™t have scatter instructions, so they wouldnโ€™t have the advantage in fewer instructions executed.

LoopVectorization isnโ€™t unrolling the steprange version by default. Unrolling more doesnโ€™t really seem to help performance, e.g. unrolling 4x:

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(clear_factors_turbo_steprange_u4!, 1_000_000, x, 202, 500_000)
       end
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
โ”Œ cpu-cycles               2.29e+09  100.0%  #  4.6 cycles per ns
โ”” task-clock               5.02e+08  100.0%  # 501.6 ms
โ”Œ instructions             1.73e+09  100.0%  #  0.8 insns per cycle
โ”‚ branch-instructions      6.80e+07  100.0%  #  3.9% of instructions
โ”” branch-misses            2.89e+03  100.0%  #  0.0% of branch instructions
โ”Œ L1-dcache-load-misses    6.55e+08  100.0%  # 419.9% of dcache loads
โ”‚ L1-dcache-loads          1.56e+08  100.0%
โ”‚ cache-misses             9.92e+02  100.0%  # 82.7% of cache references
โ”” cache-references         1.20e+03  100.0%
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”

The steprange version avoids the integer multiplication. SIMD integer multiplication is slow, but that doesnโ€™t matter here since weโ€™re totally bottlenecked on the memory accesses.

3 Likes