# @inbounds: is the compiler now so smart that this is no longer necessary?

In tests of FinEtools, the code with ALL @inbounds removed is faster than with.

1 Like

Apparently, it depends. For large arrays, it appears to make no difference, but for smaller one it does. Try the following:

``````using LinearAlgebra
using BenchmarkTools

function norm2(x)
s = 0.0
for i in eachindex(x)
s += x[i]*x[i]
end
sqrt(s)
end
function norm2bis(x)
s = 0.0
n = length(x)
for i in 1:n
s += x[i]*x[i]
end
sqrt(s)
end
function norm3(x)
s = 0.0
@inbounds for i in eachindex(x)
s += x[i]*x[i]
end
sqrt(s)
end
function norm4(x)
s = 0.0
@inbounds @simd for i in eachindex(x)
s += x[i]*x[i]
end
sqrt(s)
end

a = randn(10000)

@btime norm(\$a)
@btime norm2(\$a)
@btime norm2bis(\$a)
@btime norm3(\$a)
@btime norm4(\$a)
``````

I get

``````  3.004 ÎĽs (0 allocations: 0 bytes)
11.807 ÎĽs (0 allocations: 0 bytes)
11.802 ÎĽs (0 allocations: 0 bytes)
11.150 ÎĽs (0 allocations: 0 bytes)
710.554 ns (0 allocations: 0 bytes)
``````

so no performance penalty. OTOH for size 100, thereâ€™s a ~30% penalty of the non-inbounds version. Also the SIMD version is scandalously fast (possibly some fastmath magic there). Thatâ€™s on the official linux binaries of the beta with -O3 (non-O3 results are similar). Havenâ€™t played with the system image, though.

4 Likes

What a delightful phrase!

For @antoine-levittâ€™s test, I get the following on v0.7 for n==10000 (no -O3):

3.5 ÎĽs
11.2 ÎĽs
11.2 ÎĽs
8.4 ÎĽs
2.9 ÎĽs

On v0.6, I get

3.5 ÎĽs
8.4 ÎĽs
8.4 ÎĽs
8.7 ÎĽs
2.9 ÎĽs

So, Iâ€™m seeing a regression in v0.7 on norm2 and norm2bis. I see the same ratios with n == 100.

``````julia> @btime norm(\$a)
2.700 ÎĽs (0 allocations: 0 bytes)
100.08555723165735

julia> @btime norm2(\$a)
8.907 ÎĽs (0 allocations: 0 bytes)
100.08555723165712

julia> @btime norm2bis(\$a)
8.907 ÎĽs (0 allocations: 0 bytes)
100.08555723165712

julia> @btime norm3(\$a)
8.901 ÎĽs (0 allocations: 0 bytes)
100.08555723165712

julia> @btime norm4(\$a)
407.545 ns (0 allocations: 0 bytes)
100.08555723165735
``````

Re â€śscandalously fastâ€ť, for comparison

``````julia> z = similar(a);

julia> @btime copyto!(\$z, \$a);
1.376 ÎĽs (0 allocations: 0 bytes)
``````

possibly some fastmath magic there

Here is part of `@code_native norm4(a)` on a test computer:

``````; Location: array.jl:705
L80: # move data from vector onto CPU registers 4, 5, 6, and 7
vmovupd	-192(%rsi), %zmm4
vmovupd	-128(%rsi), %zmm5
vmovupd	-64(%rsi), %zmm6
vmovupd	(%rsi), %zmm7
;}}
; Function macro expansion; {
; Location: float.jl:395 # square and add 4,5,6, and 7 to 0,1,2, and 3.
vfmadd231pd	%zmm4, %zmm4, %zmm0
vfmadd231pd	%zmm5, %zmm5, %zmm1
vfmadd231pd	%zmm6, %zmm6, %zmm2
vfmadd231pd	%zmm7, %zmm7, %zmm3
;}
``````

Basically, it unrolls the loop by a factor of 4*SIMD length. On each iteration of the loop, it loads 4 SIMD vectors, multiplies them by themselves, and adds them to four separate cumulaton vectors.
Once the loop ends, it sums those vectors.

Check out this post by Keno:
https://juliacomputing.com/blog/2017/09/27/auto-vectorization-in-julia.html

EDIT:
Using a different computer:

``````julia> @btime norm(\$a)
4.833 ÎĽs (0 allocations: 0 bytes)
100.37241645291456

julia> @btime norm2(\$a)
7.246 ÎĽs (0 allocations: 0 bytes)
100.37241645291459

julia> @btime norm2bis(\$a)
7.204 ÎĽs (0 allocations: 0 bytes)
100.37241645291459

julia> @btime norm3(\$a)
7.201 ÎĽs (0 allocations: 0 bytes)
100.37241645291459

julia> @btime norm4(\$a)
766.965 ns (0 allocations: 0 bytes)
100.37241645291456
``````

FWIW, running `julia --math-mode=fast` yields `@simd`-like results when `@inbounds` is present, and does not improve things if not.

It does not seem easy to devise a policy to write generic code that allows to get fast results while preserving reproductibility. Julia base and stdlib seem inconsistent on the annotations they use (sometimes inbounds, sometimes simd, sometimes both). A sensible default is to write `@inbounds` in critical loops, and to use `math-mode=fast` for fast performance. Itâ€™s a shame because those annotations are pretty ugly (and in particular destroy the nice â€śfor at beginning of blockâ€ť visual cue weâ€™ve all been getting used to). Or just donâ€™t have any annotations, and run with `check-bounds=no`.

However, itâ€™s not clear to me whether check-bounds and math-mode apply to all the code, or simply non-precompiled user code.

Some official guidelines (eg in â€śperformance tipsâ€ť) would be very helpful here. Barring that, the ecosystem just uses random conventions, meaning that in the default mode (and even with check-bounds and math-mode) you can be sure neither that your results will be the same from machine to machine nor that you are getting optimal performance.

1 Like

Why isnâ€™t `copyto!` using SIMD?

Even if it is, it still has to write in memory, which is slower than reading the cache.

Here is a little experiment:

``````module moinbounds13
using FinEtools
using LinearAlgebra: Transpose, mul!
using BenchmarkTools
@assert size(Ke, 1) == size(Ke, 2)
Kedim = size(Ke, 1)
nne, mdim = size(gradN)
mx::FInt = 0
nx::FInt = 0
px::FInt = 0
for nx = 1:Kedim # Do: Ce  =  Ce + gradN*((Jac*w[j]))*gradN' ;
for px = 1:mdim
for mx = 1:nx # only the upper triangle
end
end
end
return true
end
function test()
gradN = rand(3, 2)
Ke = fill(0.0, size(gradN, 1), size(gradN, 1))
true
end
end
using .moinbounds13
moinbounds13.test()
``````

On my Surface Pro 4 it gives:

``````julia> include("test/test_inbounds.jl")
WARNING: replacing module moinbounds13.
28.141 ns (0 allocations: 0 bytes)
307.359 ns (2 allocations: 320 bytes)
283.688 ns (1 allocation: 160 bytes)
23.092 ns (0 allocations: 0 bytes)
true
``````

So, turning off `@inbounds` (`add_mggt_ut_only_wo!`) results in code that is 18% faster than with `@inbounds` (`add_mggt_ut_only!`) for all three loops.

The performance of `mul!` is surprising. (Perhaps Iâ€™m not using it right?)

Sorry, the comparison in the above example is not quite fair: `add_mggt_ut_only*! only compute the upper triangle. To complete the lower triangle takes an additional 9 ns.

The performance of mul! is surprising. (Perhaps Iâ€™m not using it right?)

By doing `1*mul!`, youâ€™re allocating a whole new vector.
You could do (aside from dropping the multiplication by 1; Iâ€™m assuming there times you really do want to scale it)â€¦

``````Ke .= 1.0 ,* mul!(Ke, gradN, gradN') # or Transpose(gradN), but they're equivalent for real matrices
``````

Or, if you donâ€™t mind using the lower level BLAS interface, you could try either of the following

``````BLAS.syrk!('U', 'N', 1.0, gradN, 0.0, Ke)
BLAS.gemm!('N', 'T', 1.0, gradN, gradN, 0.0, Ke)
``````

However, there is massive overhead on calling BLAS. With large matrices, that overhead gets amortized.
Using StaticArrays (MMatrices if you want them mutable) will be much faster for small sizes.

Anyway, the reason your function is much faster without `@inbounds` is because you need `@inbounds` for simd instructions.
On a computer, here are the machine instructions when you arenâ€™t using `@inbounds`:

``````; Function *; {
; Location: float.jl:399
vmulsd	(%rbx,%rsi,8), %xmm0, %xmm1
vmulsd	(%r12,%rdi,8), %xmm1, %xmm1
;}}
; Function +; {
; Location: float.jl:395
vaddsd	(%rdx,%rsi,8), %xmm1, %xmm1
;}
; Function setindex!; {
; Location: array.jl:745
vmovsd	%xmm1, (%rdx,%rsi,8)
;}
``````

â€śxmmâ€ť registers are 16 bytes (128 bits), but the â€ś8â€ť means itâ€™s only using 8 bytes, or a single double, at a time.
When you do add `@inbounds`, even though you didnâ€™t add `@simd`, LLVM will automatically vectorize it for you. It does include a chunk of code exactly like the above to pick up â€śleftoversâ€ť, but it has a bunch of other code too, including:

``````; Function *; {
; Location: float.jl:399
vmulpd	-192(%r14,%rbx,8), %zmm1, %zmm3
vmulpd	-128(%r14,%rbx,8), %zmm1, %zmm4
vmulpd	-64(%r14,%rbx,8), %zmm1, %zmm5
vmulpd	(%r14,%rbx,8), %zmm1, %zmm6
vmulpd	%zmm2, %zmm3, %zmm3
vmulpd	%zmm2, %zmm4, %zmm4
vmulpd	%zmm2, %zmm5, %zmm5
vmulpd	%zmm2, %zmm6, %zmm2
;}}
; Function +; {
; Location: float.jl:395
vaddpd	-192(%rcx,%rbx,8), %zmm3, %zmm3
vaddpd	-128(%rcx,%rbx,8), %zmm4, %zmm4
vaddpd	-64(%rcx,%rbx,8), %zmm5, %zmm5
vaddpd	(%rcx,%rbx,8), %zmm2, %zmm2
;}
; Function setindex!; {
; Location: array.jl:745
vmovupd	%zmm3, -192(%rcx,%rbx,8)
vmovupd	%zmm4, -128(%rcx,%rbx,8)
vmovupd	%zmm5, -64(%rcx,%rbx,8)
vmovupd	%zmm2, (%rcx,%rbx,8)
cmpq	%rbx, %r9
jne	L608
;}}
``````

The `zmm` registers are 512 bits â€“ they hold 8 doubles each. This does the same thing, but each zmm register holds 8 doubles, and it is doing 4 instances of them at a time â€“ meaning you need at least 32 doubles. On computers with avx2, that would be 256 bits == 4 doubles, for a total of at least 16 doubles to load, multiply, and at a time. (Because instructions can overlap, itâ€™s faster to do four at a time, than wait until one is done before starting the next. Hence 4 per go).

Obviously, these matrices are way too small for the above â€“ meaning the entire calculation is just the leftovers, which executes the same way on both the versions with and without `@inbounds`. While I see a few lines with

``````     movabsq	\$jl_bounds_error_ints, %rax
``````

only in the version without `@inbounds`, I guess it doesnâ€™t take as long to deal with as all that SIMD stuff that doesnâ€™t actually get used.

One advantage of using the statically sized arrays is that the compiler knows exactly how long the calculations will be â€“ meaning itâ€™ll only insert that SIMD code if itâ€™ll actually get run.

Why isnâ€™t copyto! using SIMD?
Even if it is, it still has to write in memory, which is slower than reading the cache.

Yeah, but adding `@simd` actually did help a bit.

``````julia> function copyloop!(y,x)
s = 0.0
@inbounds @simd for i in eachindex(x)
y[i] = x[i]
end
end

julia> @benchmark copyloop!(\$z, \$a)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     1.368 ÎĽs (0.00% GC)
median time:      1.379 ÎĽs (0.00% GC)
mean time:        1.398 ÎĽs (0.00% GC)
maximum time:     3.948 ÎĽs (0.00% GC)
--------------
samples:          10000
evals/sample:     10

julia> @benchmark copyto!(\$z, \$a)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     1.545 ÎĽs (0.00% GC)
median time:      1.842 ÎĽs (0.00% GC)
mean time:        1.838 ÎĽs (0.00% GC)
maximum time:     3.834 ÎĽs (0.00% GC)
--------------
samples:          10000
evals/sample:     10
``````
4 Likes

The top four curves (slower) represent matrix multiplication, and use of `mul!`, both with `.*` and without. The bottom two curves (faster) are explicit loops, and indeed as @Elrod pointed out, with larger matrices `@inbounds` has a chance of being faster than the loops without.

If one were to expect the BLAS overhead to get amortized with larger matrix sizes, it is taking quite a bit to get there: even for 225 rows and columns of the resulting matrix the explicit loops are still twice as fast as BLAS.

Did you try `syrk!` and `gemm!`?
The following are all equivalent (so long as `typeof(a) == eltype(gradN) == eltype(Ke)`).

``````Ke .= a .* mul!(Ke, gradN, gradN')
BLAS.syrk!('U', 'N', a, gradN, 0.0, Ke)
BLAS.gemm!('N', 'T', a, gradN, gradN, 0.0, Ke)
``````

The advantage here is that neither `syrk!` or `gemm!` loop over the data a second time to multiply by a.

Test functions:

``````julia>  gradN = rand(3, 2);
julia>  Ke = fill(0.0, size(gradN, 1), size(gradN, 1));
julia> mult = 2.3;
using LinearAlgebra, BenchmarkTools
@assert size(Ke, 1) == size(Ke, 2)
Kedim = size(Ke, 1)
nne, mdim = size(gradN)
mx = 0
nx = 0
px = 0
for nx = 1:Kedim # Do: Ce  =  Ce + gradN*((Jac*w[j]))*gradN' ;
for px = 1:mdim
for mx = 1:nx # only the upper triangle
end
end
end
Ke
end
@assert size(Ke, 1) == size(Ke, 2)
Kedim = size(Ke, 1)
nne, mdim = size(gradN)
mx = 0
nx = 0
px = 0
@inbounds for nx = 1:Kedim # Do: Ce  =  Ce + gradN*((Jac*w[j]))*gradN' ;
for px = 1:mdim
for mx = 1:nx # only the upper triangle
end
end
end
Ke
end
``````

Running the benchmarks with small matrices:

``````julia> @btime \$Ke .= \$mult .* mul!(\$Ke, \$gradN, \$gradN')
334.450 ns (0 allocations: 0 bytes)
3Ă—3 Array{Float64,2}:
0.70059   1.51745  0.629522
1.51745   3.63988  1.33395
0.629522  1.33395  0.56814

julia> fill!(Ke, 0); # Just to confirm the result is the same each time

julia> @btime BLAS.syrk!('U', 'N', \$mult, \$gradN, 0.0, \$Ke)
299.895 ns (0 allocations: 0 bytes)
3Ă—3 Array{Float64,2}:
0.70059  1.51745  0.629522
0.0      3.63988  1.33395
0.0      0.0      0.56814

julia> fill!(Ke, 0);

julia> @btime BLAS.gemm!('N', 'T', \$mult, \$gradN, \$gradN, 0.0, \$Ke)
147.368 ns (0 allocations: 0 bytes)
3Ă—3 Array{Float64,2}:
0.70059   1.51745  0.629522
1.51745   3.63988  1.33395
0.629522  1.33395  0.56814

julia> fill!(Ke, 0);

3Ă—3 Array{Float64,2}:
0.70059  1.51745  0.629522
0.0      3.63988  1.33395
0.0      0.0      0.56814

21.483 ns (0 allocations: 0 bytes)

17.165 ns (0 allocations: 0 bytes)
``````

With 250x200, and OpenBLAS

``````julia> BLAS.set_num_threads(1) # BLAS easily wins with multithreading

julia> gradN = rand(250, 200);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime \$Ke .= \$mult .* mul!(\$Ke, \$gradN, \$gradN');
402.399 ÎĽs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', \$mult, \$gradN, 0.0, \$Ke);
344.340 ÎĽs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', \$mult, \$gradN, \$gradN, 0.0, \$Ke);
528.487 ÎĽs (0 allocations: 0 bytes)

3.066 ms (0 allocations: 0 bytes)

4.401 ms (0 allocations: 0 bytes)
``````

and MKL:

``````julia> @btime \$Ke .= \$mult .* mul!(\$Ke, \$gradN, \$gradN');
204.780 ÎĽs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', \$mult, \$gradN, 0.0, \$Ke);
151.862 ÎĽs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', \$mult, \$gradN, \$gradN, 0.0, \$Ke);
256.089 ÎĽs (0 allocations: 0 bytes)
``````

With threading, gemm! is a lot faster than mul!. I thought mul! was just a wrapper for gemm!, so Iâ€™m not sure what is going on. MKL:

``````julia> BLAS.set_num_threads(10)

julia> @btime \$Ke .= \$mult .* mul!(\$Ke, \$gradN, \$gradN');
123.680 ÎĽs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', \$mult, \$gradN, 0.0, \$Ke);
29.612 ÎĽs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', \$mult, \$gradN, \$gradN, 0.0, \$Ke);
43.061 ÎĽs (0 allocations: 0 bytes)
``````

I would have guessed that `mul!` is now slower because the broadcast part isnâ€™t multithreaded, butâ€¦

``````julia> @btime mul!(\$Ke, \$gradN, \$gradN');
80.795 ÎĽs (0 allocations: 0 bytes)
``````

Also, even at 50 Iâ€™m seeing dramatically better performance, even with OpenBLAS (which is slower at small/modest matrix sizes than MKL):

``````julia> gradN = rand(50, 35);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime \$Ke .= \$mult .* mul!(\$Ke, \$gradN, \$gradN');
7.763 ÎĽs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', \$mult, \$gradN, 0.0, \$Ke);
5.970 ÎĽs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', \$mult, \$gradN, \$gradN, 0.0, \$Ke);
6.448 ÎĽs (0 allocations: 0 bytes)

23.383 ÎĽs (0 allocations: 0 bytes)

32.695 ÎĽs (0 allocations: 0 bytes)
``````

Could something is wrong with your BLAS install? Maybe itâ€™s using a generic fall back kernels, instead of ones optimized for your processor?
If youâ€™re on Julia 0.6.2 or older with a fairly new processor, thatâ€™s sort of likely. Julia 0.6.2â€™s OpenBLAS didnâ€™t support Ryzen, for example.

This is my configuration:
Julia Version 0.7.0-beta.279
Commit b0f531e5f0* (2018-07-12 15:33 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: IntelÂ® Coreâ„˘ i7-6650U CPU @ 2.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
Environment:

I have to admit that I havenâ€™t bothered building Julia for my particular setup. It is the generic Windows installation.

Did you disabled multi threading inside the BLAS routines? I noticed that at least in 0.6 there is a significant overhead if BLAS tries to multi thread.

No, I havenâ€™t tried disabling multithreading.

My first example was with disabled multithreading, yet it was still much faster. When I re-enabled it, it got much faster still at larger sizes.
Iâ€™m on 0.7, and OpenBLAS now seems smarter about being single threaded for smaller matrix sizes. Eg, as far as I could tell it ran single threaded here, regardless of what BLAS.set_num_threads was set to.

``````julia> BLAS.set_num_threads(1)

julia> gradN = rand(60, 40);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime \$Ke .= \$mult .* mul!(\$Ke, \$gradN, \$gradN');
11.014 ÎĽs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', \$mult, \$gradN, 0.0, \$Ke);
8.425 ÎĽs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', \$mult, \$gradN, \$gradN, 0.0, \$Ke);
9.289 ÎĽs (0 allocations: 0 bytes)

julia> @btime \$Ke .= \$mult .* mul!(\$Ke, \$gradN, \$gradN');
10.994 ÎĽs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', \$mult, \$gradN, 0.0, \$Ke);
8.420 ÎĽs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', \$mult, \$gradN, \$gradN, 0.0, \$Ke);
9.277 ÎĽs (0 allocations: 0 bytes)

37.251 ÎĽs (0 allocations: 0 bytes)

52.913 ÎĽs (0 allocations: 0 bytes)
``````

Times were the same, and I never saw CPU usage exceed 100%.

Even with the generic installs, OpenBLAS is supposed to detect your processor and pick the appropriate kernel, so it shouldnâ€™t matter whether you built it from source or not.

BTW, @PetrKryslUCSD, if youâ€™re primarily concerned about small sizes and willing to use StaticArrays:

``````julia> using StaticArrays

julia> gradN = @MMatrix rand(3, 2);

julia> Ke = @MMatrix fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime \$Ke .= \$mult .* mul!(\$Ke, \$gradN, \$gradN');
5.714 ns (0 allocations: 0 bytes)

8.696 ns (0 allocations: 0 bytes)

8.697 ns (0 allocations: 0 bytes)
``````

MMatrix also defaults to BLAS calls at larger sizes, but many of the other StaticArrays functions (eg, `@MMatrix randn(m, n)` ) arenâ€™t optimized for big matrices, so lots of things would randomly be painfully slow.

1 Like

This includes `BLAS.gemm!('N', 'T', \$mult, \$gradN, \$gradN, 0.0, \$Ke)`, which eventually becomes the fastest method. (In my application `gradN` is not likely to be bigger than 64x3.)

Ah, okay. I think the skinniness of your matrix is part of the problem.
I wouldâ€™ve called a 14x14 matrix â€śsmallâ€ť, and that has more elements than the 64x3. (Although X * Xâ€™ would be much larger for the latter).

3 may also be an awkward width. Eg, on my computer, BLAS takes about the same long for 64x3 as it does for 64x4, while the loop versions are of course about 33% faster for the former:

``````julia> gradN = rand(64, 3);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime \$Ke .= \$mult .* mul!(\$Ke, \$gradN, \$gradN');
6.355 ÎĽs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', \$mult, \$gradN, 0.0, \$Ke);
2.933 ÎĽs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', \$mult, \$gradN, \$gradN, 0.0, \$Ke);
2.800 ÎĽs (0 allocations: 0 bytes)

3.731 ÎĽs (0 allocations: 0 bytes)

4.930 ÎĽs (0 allocations: 0 bytes)

julia> gradN = rand(64, 4);

julia> Ke = fill(0.0, size(gradN, 1), size(gradN, 1));

julia> @btime \$Ke .= \$mult .* mul!(\$Ke, \$gradN, \$gradN');
6.355 ÎĽs (0 allocations: 0 bytes)

julia> @btime BLAS.syrk!('U', 'N', \$mult, \$gradN, 0.0, \$Ke);
3.100 ÎĽs (0 allocations: 0 bytes)

julia> @btime BLAS.gemm!('N', 'T', \$mult, \$gradN, \$gradN, 0.0, \$Ke);
2.981 ÎĽs (0 allocations: 0 bytes)