# Simple Mat-Vec multiply (understanding performance, without the bugs)

I wanted to show someone that straight ahead Julia code is reasonably fast. So I used the following:

``````using BenchmarkTools,LinearAlgebra

function matmul(A,v)
if size(A,2) != length(v)
throw(DimensionMismatch("second dimension of A, \$size(A,2), does not match length of v, \$length(v)"))
end
B = zeros(size(v))
@inbounds for j in 1:size(A,2)
@simd for i in 1:size(A,1)
B[i] += A[i,j] * v[j]
end
end
return B
end

const testA = rand(500,500);
const testV = rand(500);

@benchmark matmul(\$testA,\$testV)

@benchmark \$testA * \$testV

norm(matmul(testA,testV) - testA*testV)
``````

and got within about a factor of 2 performance differences:

``````
julia> @benchmark matmul(\$testA,\$testV)
BenchmarkTools.Trial:
memory estimate:  4.06 KiB
allocs estimate:  1
--------------
minimum time:     43.095 Ī¼s (0.00% GC)
median time:      44.030 Ī¼s (0.00% GC)
mean time:        44.565 Ī¼s (0.00% GC)
maximum time:     192.452 Ī¼s (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> @benchmark \$testA * \$testV
BenchmarkTools.Trial:
memory estimate:  4.06 KiB
allocs estimate:  1
--------------
minimum time:     19.737 Ī¼s (0.00% GC)
median time:      22.607 Ī¼s (0.00% GC)
mean time:        44.927 Ī¼s (0.00% GC)
maximum time:     10.635 ms (0.00% GC)
--------------
samples:          10000
evals/sample:     1

``````

I used all the tips I was aware of:

constant global variables
using benchmark, with interpolation
iterate over columns first
@simd, @inbounds

Is there something I should do to add to this before trying to use it as a demo?

1 Like

You could remove the outer simd macro and itās always good to check the dimensions of the input arrays when turning off boundschecking.

2 Likes

Removing the outer @simd macro resulted in no change, but I assume itās not useful/doing anything in that case anyway.

I guess your point about the bounds is that Julia matrices and etc donāt have to start at 1 and go to size, right? You can start your indexes at other values etc? I havenāt looked into that very carefully.

Thatās all true, but a more obvious check in this case is that `size(A, 2) == length(v)`

3 Likes

Good point. Adding this throws the same error as native * does now:

``````    if size(A,2) != length(v)
throw(DimensionMismatch("second dimension of A, \$size(A,2), does not match length of v, \$length(v)"))
end
``````

(edited original post as well)

`const` isnāt necessary when you use interpolation.

You could also look at the `@avx` macro from LoopVectorization.jl. And did you consider multithreading?

1 Like

The goal was to show performance of straightforward code, not the features of the language that make it easy to parallelize stuff.

How would @avx interact with @simd etc?

I think `@avx` uses simd internally.

2 Likes

These were for 500x500 matrices, when I do @threads it is way longerā¦ I increased to 1500x1500 and still threads donāt help. Definitely some big matricesā¦ like 100000 x 100000 would benefit from threading, but when it takes 100ms or less to do the multiply the overhead of threads is too high to benefit.

On the other hand, maybe other overhead is involvedā¦ hereās unthreaded but bigger problem:

``````julia> const testA = rand(15000,15000);
WARNING: redefining constant testA

julia> const testV = rand(15000);
WARNING: redefining constant testV

julia> @benchmark matmul(\$testA,\$testV)
BenchmarkTools.Trial:
memory estimate:  117.27 KiB
allocs estimate:  2
--------------
minimum time:     124.163 ms (0.00% GC)
median time:      131.162 ms (0.00% GC)
mean time:        131.578 ms (0.00% GC)
maximum time:     141.866 ms (0.00% GC)
--------------
samples:          39
evals/sample:     1

julia> @benchmark \$testA * \$testV
BenchmarkTools.Trial:
memory estimate:  117.27 KiB
allocs estimate:  2
--------------
minimum time:     100.122 ms (0.00% GC)
median time:      104.185 ms (0.00% GC)
mean time:        104.557 ms (0.00% GC)
maximum time:     110.910 ms (0.00% GC)
--------------
samples:          48
evals/sample:     1
``````

They run very similar in time, showing that with the smaller 500x500 multiplies, probably the overhead of dispatch or something was part of the issue.

Itās definitely not the case that you need that large matrices. I ran multithreading today on an operation on small vectors (800 elements) with a runtime of 1ms. With 8 threads runtime sank to 120us.

I think thereās something else thatās tripping up the threads in this case.

1 Like

I donāt think `@avx` uses `@simd` at all internally - it does its own vectorization instead of passing the buck to LLVM. Thereās been a longstanding hope among Julia devs that LLVM could do the same thing if the Julia IR provides enough information, but that hasnāt come to fruition yet, so @Elrod took the matter into his own hands.

``````using LoopVectorization, BenchmarkTools

function jgemvavx(š, š±)
š² = copy(š±)
@avx for i ā eachindex(š²)
š²įµ¢ = zero(eltype(š²))
for j ā eachindex(š±)
š²įµ¢ += š[i,j] * š±[j]
end
š²[i] = š²įµ¢
end
end
``````
``````julia> A = rand(500, 500); v = rand(500);

julia> @btime matmul(\$A, \$b);
19.400 Ī¼s (1 allocation: 4.06 KiB)

julia> @btime \$A * \$b;
59.999 Ī¼s (1 allocation: 4.06 KiB)
``````

(pulled from the benchmarks)

2 Likes

OK,

``````
function matmul(A,v)
if size(A,2) != length(v)
throw(DimensionMismatch("second dimension of A, \$size(A,2), does not match length of v, \$length(v)"))
end
B = zeros(size(v))
@inbounds for j in 1:size(A,2)
@avx for i in 1:size(A,1)
B[i] += A[i,j] * v[j]
end
end
return B
end
``````

gives me essentially equal time to using `*` for 1500x1500 matrix

``````
julia> @benchmark matmul(\$testA,\$testV)
BenchmarkTools.Trial:
memory estimate:  11.88 KiB
allocs estimate:  1
--------------
minimum time:     1.007 ms (0.00% GC)
median time:      1.047 ms (0.00% GC)
mean time:        1.061 ms (0.00% GC)
maximum time:     1.765 ms (0.00% GC)
--------------
samples:          4690
evals/sample:     1

julia> @benchmark \$testA * \$testV
BenchmarkTools.Trial:
memory estimate:  11.88 KiB
allocs estimate:  1
--------------
minimum time:     902.513 Ī¼s (0.00% GC)
median time:      969.267 Ī¼s (0.00% GC)
mean time:        1.014 ms (0.00% GC)
maximum time:     18.704 ms (0.00% GC)
--------------
samples:          4873
evals/sample:     1
``````

Nice!

``````
function matmul(A,v)
if size(A,2) != length(v)
throw(DimensionMismatch("second dimension of A, \$size(A,2), does not match length of v, \$length(v)"))
end
B = zeros(size(v))
@avx for i in 1:size(A,1)
B[i] += A[i,j] * v[j]
end
end
return B
end

``````

it does absolutely nothing to the time / slightly longer

``````julia> @benchmark matmul(\$testA,\$testV)
BenchmarkTools.Trial:
memory estimate:  14.80 KiB
allocs estimate:  24
--------------
minimum time:     996.124 Ī¼s (0.00% GC)
median time:      1.080 ms (0.00% GC)
mean time:        1.094 ms (0.00% GC)
maximum time:     1.742 ms (0.00% GC)
--------------
samples:          4545
evals/sample:     1

``````

The first thing I would note here is the benchmark:

Notice how performance plummets beyond around `60`x`60`?
This is because LoopVectorization does register tiling, but does not yet do cache tiling. I plan for this to change in the future, but for now, that means we must do the cache-tiling ourselves.

What does ātilingā mean? It means operating block-wise to get data re-use. āCache-tilingā means re-use in the CPUās cache, and āregister-tilingā means re-use in the CPU-registers.

``````@inbounds for j in 1:size(A,2)
@simd for i in 1:size(A,1)
y[i] += A[i,j] * x[j]
end
end
``````

When register-tiling here, note that as the loop above loop is written, we need to:

• load `x[j]` a total of `size(A,2)` times ā minimum possible.
• load `A[i,j]` a total of `size(A,1) * size(A,2)` times ā minimum possible.
• load and store `y[i]` a total of `size(A,1) * size(A,2)` times ā an excess factor of `size(A,2)`.

Register tiling will attack that third point. To investigate what exactly it does:

``````julia> using LoopVectorization

julia> M = N = 500;

julia> y = Vector{Float64}(undef, M); A = rand(M,N); x = rand(N);

julia> ls = LoopVectorization.@avx_debug for j in 1:size(A,2), i in 1:size(A,1)
y[i] += A[i,j] * x[j]
end;
OPS = Tuple{:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x02),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x03),:numericconstant,Symbol("##reductzero#640"),LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x04),:LoopVectorization,:vfmadd_fast,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000001, 0x0000000000000000, 0x0000000000010204, LoopVectorization.compute, 0x00, 0x04),:LoopVectorization,:reduced_add,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000001, 0x0000000000000000, 0x0000000000000503, LoopVectorization.compute, 0x00, 0x03),:LoopVectorization,:setindex!,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000006, LoopVectorization.memstore, 0x03, 0x05)}
ARF = Tuple{LoopVectorization.ArrayRefStruct{:A,Symbol("##vptr##_A")}(0x0000000000000101, 0x0000000000000201, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:x,Symbol("##vptr##_x")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:y,Symbol("##vptr##_y")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000)}
AM = Tuple{0,Tuple{},Tuple{},Tuple{},Tuple{},Tuple{(4, LoopVectorization.IntOrFloat)},Tuple{}}
LPSYM = Tuple{:j,:i}
LB = Tuple{VectorizationBase.StaticLowerUnitRange{1},VectorizationBase.StaticLowerUnitRange{1}}
vargs = (VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00005605ec236e00, (400,)), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c620, ()), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c240, ()))

julia> LoopVectorization.choose_order(ls)
([:i, :j], :j, :i, :i, 1, 8)
``````

The vector `[:i, :j]` gives the order of the loops from outer-most to inner-most. That is, it placed the `i` loop as outer-most, while the `j` loop is inner-most.
The next two symbols give which loops are unrolled by factors corresponding to the integers:
`j` is unrolled by `1`x (i.e., it isnāt), and `i` by `8`x.
Given that `j` is now the innermost loop, not much sense in unrolling it.
The last symbol refers to which loop is actually vectorized.
Note that this can change as a function of types:

``````julia> B = A';

julia> lst = LoopVectorization.@avx_debug for j in 1:size(A,2), i in 1:size(A,1)
y[i] += B[i,j] * x[j]
end;
OPS = Tuple{:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x02),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x03),:numericconstant,Symbol("##reductzero#656"),LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x04),:LoopVectorization,:vfmadd_fast,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000001, 0x0000000000000000, 0x0000000000010204, LoopVectorization.compute, 0x00, 0x04),:LoopVectorization,:reduced_add,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000001, 0x0000000000000000, 0x0000000000000503, LoopVectorization.compute, 0x00, 0x03),:LoopVectorization,:setindex!,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000006, LoopVectorization.memstore, 0x03, 0x05)}
ARF = Tuple{LoopVectorization.ArrayRefStruct{:B,Symbol("##vptr##_B")}(0x0000000000000101, 0x0000000000000201, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:x,Symbol("##vptr##_x")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:y,Symbol("##vptr##_y")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000)}
AM = Tuple{0,Tuple{},Tuple{},Tuple{},Tuple{},Tuple{(4, LoopVectorization.IntOrFloat)},Tuple{}}
LPSYM = Tuple{:j,:i}
LB = Tuple{VectorizationBase.StaticLowerUnitRange{1},VectorizationBase.StaticLowerUnitRange{1}}
vargs = (VectorizationBase.RowMajorStridedPointer{Float64,1}(Ptr{Float64} @0x00005605ec236e00, (400,)), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c620, ()), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c240, ()))

julia> LoopVectorization.choose_order(lst) # j is now vectorized instead of `i`
([:i, :j], :j, :i, :j, 1, 8)
``````

Anyway, back to the original `([:i, :j], :j, :i, :i, 1, 8)` example,
We still load from `A` for each combination. We were already at the minimum, so no savings there.
As `i` is now the outer-most loop, we now load and store from `y[i]` a total of `size(A,1)` times.

However, we do need to load from `x[j]`, but by `cld(size(A,1),8)*size(A,2)`. We get a reducing factor of `8`, because we get to re-use the load from `x[j]` for each factor by which the `i` loop is unrolled. `8`x unrolling means we get an `8`x reduction in how many times we have to load from this vector!

We can see some spectacular performance improvements from this:

``````julia> function gemvavx1!(y, A, x)
@avx for j in axes(A,2), i in axes(A,1)
y[i] += A[i,j] * x[j]
end
end
gemvavx1! (generic function with 1 method)

julia> function gemvavx2!(y, A, x)
@avx for i in axes(A,1)
yįµ¢ = zero(eltype(y))
for j ā axes(A,2)
yįµ¢ += A[i,j] * x[j]
end
y[i] = yįµ¢
end
end
gemvavx2! (generic function with 1 method)

julia> function gemv_simd!(y, A, x)
@inbounds for j in axes(A,2); @simd ivdep for i in axes(A,1)
y[i] += A[i,j] * x[j]
end; end
end
gemv_simd! (generic function with 1 method)

julia> M = N = 56;

julia> y1 = Vector{Float64}(undef, M); A = rand(M,N); x = rand(N); y2 = similar(y1); y3 = similar(y1);

julia> @benchmark gemv_simd!(fill!(\$y1, 0), \$A, \$x)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     862.015 ns (0.00% GC)
median time:      867.697 ns (0.00% GC)
mean time:        867.474 ns (0.00% GC)
maximum time:     1.537 Ī¼s (0.00% GC)
--------------
samples:          10000
evals/sample:     66

julia> @benchmark gemvavx1!(fill!(\$y2, 0), \$A, \$x)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     92.937 ns (0.00% GC)
median time:      93.351 ns (0.00% GC)
mean time:        93.472 ns (0.00% GC)
maximum time:     147.397 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     954

julia> @benchmark gemvavx2!(\$y3, \$A, \$x)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     76.973 ns (0.00% GC)
median time:      77.107 ns (0.00% GC)
mean time:        77.198 ns (0.00% GC)
maximum time:     130.935 ns (0.00% GC)
--------------
samples:          10000
evals/sample:     969

julia> y1 ā y2 ā y3
true
``````

But ā only at these tiny sizes.
Your CPU is way faster at computing than it is at moving memory around.
And `A * x` requires on the order of `length(A)` operations, and on the same order of memory. So in the end, itāll perform more or less the same as `sum`:

``````julia> function sumavx(A)
s = zero(eltype(A))
@avx for i ā eachindex(A)
s += A[i]
end
s
end
sumavx (generic function with 1 method)

julia> @btime sum(\$A)
192.643 ns (0 allocations: 0 bytes)
1578.41421948359

julia> @btime sumavx(\$A)
67.960 ns (0 allocations: 0 bytes)
1578.41421948359
``````

Since `A * x` and `sum(A)` require accessing roughly the same amount of memory.
And once these vectors stop fitting in our CPUās caches, memory bandwidth is all that determines performance. Make `A` `500`x`500`, and `gemv_simd!`, which was so slow before, is now equally fast as everything else:

``````julia> M = N = 500;

julia> y1 = Vector{Float64}(undef, M); A = rand(M,N); x = rand(N); y2 = similar(y1); y3 = similar(y1);

julia> @benchmark gemv_simd!(fill!(\$y1, 0), \$A, \$x)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     82.602 Ī¼s (0.00% GC)
median time:      85.719 Ī¼s (0.00% GC)
mean time:        85.867 Ī¼s (0.00% GC)
maximum time:     168.723 Ī¼s (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> @benchmark gemvavx1!(fill!(\$y2, 0), \$A, \$x)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     82.080 Ī¼s (0.00% GC)
median time:      84.833 Ī¼s (0.00% GC)
mean time:        84.985 Ī¼s (0.00% GC)
maximum time:     257.037 Ī¼s (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> @benchmark gemvavx2!(\$y3, \$A, \$x)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     83.944 Ī¼s (0.00% GC)
median time:      84.750 Ī¼s (0.00% GC)
mean time:        84.895 Ī¼s (0.00% GC)
maximum time:     258.025 Ī¼s (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> y1 ā y2 ā y3
true

julia> @btime sumavx(\$A)
83.417 Ī¼s (0 allocations: 0 bytes)
125042.8745667071

julia> @btime sumsimd(\$A)
83.430 Ī¼s (0 allocations: 0 bytes)
125042.87456670718
``````

So the best you can do is add multithreading and break up the `i` vector (you canāt parallelize it, or youāre be reading and writing to `y` with different threads simultaneously.

``````function gemv_threads!(y, A, x)
M = length(y)
chunks > 1 || return gemvavx2!(y, A, x)
chunk_size = LoopVectorization.VectorizationBase.align(cld(M, chunks), eltype(y))
b = chunk_size*(c-1) + 1
e = min(M, chunk_size*c)
r = b:e
gemvavx2!(view(y, r), view(A, r, :), x)
end
end
y
end
``````

With this, for 500x500, I get

``````julia> @benchmark gemv_threads!(\$y3, \$A, \$x)
BenchmarkTools.Trial:
memory estimate:  2.09 KiB
allocs estimate:  30
--------------
minimum time:     10.343 Ī¼s (0.00% GC)
median time:      35.046 Ī¼s (0.00% GC)
mean time:        35.778 Ī¼s (0.00% GC)
maximum time:     164.298 Ī¼s (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> @benchmark mul!(\$y2, \$A, \$x)
BenchmarkTools.Trial:
memory estimate:  0 bytes
allocs estimate:  0
--------------
minimum time:     12.815 Ī¼s (0.00% GC)
median time:      15.596 Ī¼s (0.00% GC)
mean time:        15.964 Ī¼s (0.00% GC)
maximum time:     82.304 Ī¼s (0.00% GC)
--------------
samples:          10000
evals/sample:     1

julia> y2 ā y3
true
``````

Peak performance is similar with OpenBLAS here, but average performance is much worse.

As @stillyslalom said, it does itās own modeling to make decisions and then generates code with lots of `llvmcall`.

13 Likes

For large matrix sizes (to the point of being memory bandwidth-limited), you can get multithreading pretty much for free via Tullio.jl, which uses threads + `@avx` to generate its loop kernels:

``````tmatmul(A, v) = @tullio b[i] := A[i,j]*v[j]
``````
``````julia> @btime \$A * \$b
498.300 Ī¼s (1 allocation: 11.88 KiB)

julia> @btime tmatmul(\$A, \$b)
469.400 Ī¼s (257 allocations: 30.81 KiB)
``````

At that point, thereās not much to be gained by further algorithmic twiddling - youāre solely limited by the number & bandwidth of your memory channels.

2 Likes

I said simd, not `@simd`.

Tullio is brilliant, not only do you get threads and avx for free, you get a concise notation that makes your code super clear and way less bug prone. Iām immediately adopting it for whenever I need to write array twiddling!

Iāll test the speed and post a comparison in a few hours. Thanks for this!

Ok, just to see how well it worksā¦ hereās the Tullio version:

``````
function matmultull(A,v)
if size(A,2) != length(v)
throw(DimensionMismatch("second dimension of A, \$size(A,2), does not match length of v, \$length(v)"))
end
B = copy(v)
@tullio B[i] = A[i,j] *v[j]
return B
end
``````

ZERO loops for that undergrad to get wrong.

Howās the speed? EXACTLY the same as with my hand written undergradish loop code + @avx (and basically the same as built in *)

``````julia> @benchmark matmul(\$testA,\$testV)
BenchmarkTools.Trial:
memory estimate:  14.80 KiB
allocs estimate:  24
--------------
minimum time:     990.039 Ī¼s (0.00% GC)
median time:      1.084 ms (0.00% GC)
mean time:        1.102 ms (0.00% GC)
maximum time:     1.707 ms (0.00% GC)
--------------
samples:          4508
evals/sample:     1

julia> @benchmark matmultull(\$testA,\$testV)
BenchmarkTools.Trial:
memory estimate:  18.55 KiB
allocs estimate:  99
--------------
minimum time:     993.795 Ī¼s (0.00% GC)
median time:      1.054 ms (0.00% GC)
mean time:        1.089 ms (0.00% GC)
maximum time:     2.290 ms (0.00% GC)
--------------
samples:          4562
evals/sample:     1

@benchmark \$testA * \$testV
BenchmarkTools.Trial:
memory estimate:  11.88 KiB
allocs estimate:  1
--------------
minimum time:     831.787 Ī¼s (0.00% GC)
median time:      924.277 Ī¼s (0.00% GC)
mean time:        1.027 ms (0.00% GC)
maximum time:     10.739 ms (0.00% GC)
--------------
samples:          4811
evals/sample:     1

``````

Thanks again!

2 Likes