Simple Mat-Vec multiply (understanding performance, without the bugs)

I wanted to show someone that straight ahead Julia code is reasonably fast. So I used the following:

using BenchmarkTools,LinearAlgebra

function matmul(A,v)
    if size(A,2) != length(v)
        throw(DimensionMismatch("second dimension of A, $size(A,2), does not match length of v, $length(v)"))
    end
    B = zeros(size(v))
    @inbounds for j in 1:size(A,2)
        @simd for i in 1:size(A,1)
            B[i] += A[i,j] * v[j]
        end
    end
    return B
end

const testA = rand(500,500);
const testV = rand(500);

@benchmark matmul($testA,$testV)

@benchmark $testA * $testV

norm(matmul(testA,testV) - testA*testV)

and got within about a factor of 2 performance differences:


julia> @benchmark matmul($testA,$testV)
BenchmarkTools.Trial: 
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     43.095 Ī¼s (0.00% GC)
  median time:      44.030 Ī¼s (0.00% GC)
  mean time:        44.565 Ī¼s (0.00% GC)
  maximum time:     192.452 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark $testA * $testV
BenchmarkTools.Trial: 
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     19.737 Ī¼s (0.00% GC)
  median time:      22.607 Ī¼s (0.00% GC)
  mean time:        44.927 Ī¼s (0.00% GC)
  maximum time:     10.635 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

I used all the tips I was aware of:

constant global variables
using benchmark, with interpolation
iterate over columns first
@simd, @inbounds

Is there something I should do to add to this before trying to use it as a demo?

1 Like

You could remove the outer simd macro and itā€™s always good to check the dimensions of the input arrays when turning off boundschecking.

2 Likes

Removing the outer @simd macro resulted in no change, but I assume itā€™s not useful/doing anything in that case anyway.

I guess your point about the bounds is that Julia matrices and etc donā€™t have to start at 1 and go to size, right? You can start your indexes at other values etc? I havenā€™t looked into that very carefully.

Thatā€™s all true, but a more obvious check in this case is that size(A, 2) == length(v)

3 Likes

Good point. Adding this throws the same error as native * does now:

    if size(A,2) != length(v)
        throw(DimensionMismatch("second dimension of A, $size(A,2), does not match length of v, $length(v)"))
    end

(edited original post as well)

const isnā€™t necessary when you use interpolation.

You could also look at the @avx macro from LoopVectorization.jl. And did you consider multithreading?

1 Like

The goal was to show performance of straightforward code, not the features of the language that make it easy to parallelize stuff.

How would @avx interact with @simd etc?

Just mentioning multithreading since itā€™s likely that your BLAS calls are threaded.

I think @avx uses simd internally.

2 Likes

These were for 500x500 matrices, when I do @threads it is way longerā€¦ I increased to 1500x1500 and still threads donā€™t help. Definitely some big matricesā€¦ like 100000 x 100000 would benefit from threading, but when it takes 100ms or less to do the multiply the overhead of threads is too high to benefit.

On the other hand, maybe other overhead is involvedā€¦ hereā€™s unthreaded but bigger problem:

julia> const testA = rand(15000,15000);
WARNING: redefining constant testA

julia> const testV = rand(15000);
WARNING: redefining constant testV

julia> @benchmark matmul($testA,$testV)
BenchmarkTools.Trial: 
  memory estimate:  117.27 KiB
  allocs estimate:  2
  --------------
  minimum time:     124.163 ms (0.00% GC)
  median time:      131.162 ms (0.00% GC)
  mean time:        131.578 ms (0.00% GC)
  maximum time:     141.866 ms (0.00% GC)
  --------------
  samples:          39
  evals/sample:     1

julia> @benchmark $testA * $testV
BenchmarkTools.Trial: 
  memory estimate:  117.27 KiB
  allocs estimate:  2
  --------------
  minimum time:     100.122 ms (0.00% GC)
  median time:      104.185 ms (0.00% GC)
  mean time:        104.557 ms (0.00% GC)
  maximum time:     110.910 ms (0.00% GC)
  --------------
  samples:          48
  evals/sample:     1

They run very similar in time, showing that with the smaller 500x500 multiplies, probably the overhead of dispatch or something was part of the issue.

Itā€™s definitely not the case that you need that large matrices. I ran multithreading today on an operation on small vectors (800 elements) with a runtime of 1ms. With 8 threads runtime sank to 120us.

I think thereā€™s something else thatā€™s tripping up the threads in this case.

1 Like

did you use @threads for your example case?

I donā€™t think @avx uses @simd at all internally - it does its own vectorization instead of passing the buck to LLVM. Thereā€™s been a longstanding hope among Julia devs that LLVM could do the same thing if the Julia IR provides enough information, but that hasnā€™t come to fruition yet, so @Elrod took the matter into his own hands.

using LoopVectorization, BenchmarkTools

function jgemvavx(š€, š±)
    š² = copy(š±)
    @avx for i āˆˆ eachindex(š²)
        š²įµ¢ = zero(eltype(š²))
        for j āˆˆ eachindex(š±)
            š²įµ¢ += š€[i,j] * š±[j]
        end
        š²[i] = š²įµ¢
    end
end
julia> A = rand(500, 500); v = rand(500);

julia> @btime matmul($A, $b);
  19.400 Ī¼s (1 allocation: 4.06 KiB)

julia> @btime $A * $b;
  59.999 Ī¼s (1 allocation: 4.06 KiB)

(pulled from the benchmarks)

2 Likes

OK,


function matmul(A,v)
    if size(A,2) != length(v)
        throw(DimensionMismatch("second dimension of A, $size(A,2), does not match length of v, $length(v)"))
    end
    B = zeros(size(v))
    @inbounds for j in 1:size(A,2)
        @avx for i in 1:size(A,1)
            B[i] += A[i,j] * v[j]
        end
    end
    return B
end

gives me essentially equal time to using * for 1500x1500 matrix


julia> @benchmark matmul($testA,$testV)
BenchmarkTools.Trial: 
  memory estimate:  11.88 KiB
  allocs estimate:  1
  --------------
  minimum time:     1.007 ms (0.00% GC)
  median time:      1.047 ms (0.00% GC)
  mean time:        1.061 ms (0.00% GC)
  maximum time:     1.765 ms (0.00% GC)
  --------------
  samples:          4690
  evals/sample:     1

julia> @benchmark $testA * $testV
BenchmarkTools.Trial: 
  memory estimate:  11.88 KiB
  allocs estimate:  1
  --------------
  minimum time:     902.513 Ī¼s (0.00% GC)
  median time:      969.267 Ī¼s (0.00% GC)
  mean time:        1.014 ms (0.00% GC)
  maximum time:     18.704 ms (0.00% GC)
  --------------
  samples:          4873
  evals/sample:     1

Nice!

When I switch to trying to thread it on 4 threadsā€¦


function matmul(A,v)
    if size(A,2) != length(v)
        throw(DimensionMismatch("second dimension of A, $size(A,2), does not match length of v, $length(v)"))
    end
    B = zeros(size(v))
    Threads.@threads for j in 1:size(A,2)
        @avx for i in 1:size(A,1)
            B[i] += A[i,j] * v[j]
        end
    end
    return B
end

it does absolutely nothing to the time / slightly longer

julia> @benchmark matmul($testA,$testV)
BenchmarkTools.Trial: 
  memory estimate:  14.80 KiB
  allocs estimate:  24
  --------------
  minimum time:     996.124 Ī¼s (0.00% GC)
  median time:      1.080 ms (0.00% GC)
  mean time:        1.094 ms (0.00% GC)
  maximum time:     1.742 ms (0.00% GC)
  --------------
  samples:          4545
  evals/sample:     1


The first thing I would note here is the benchmark:

Notice how performance plummets beyond around 60x60?
This is because LoopVectorization does register tiling, but does not yet do cache tiling. I plan for this to change in the future, but for now, that means we must do the cache-tiling ourselves.

What does ā€œtilingā€ mean? It means operating block-wise to get data re-use. ā€œCache-tilingā€ means re-use in the CPUā€™s cache, and ā€œregister-tilingā€ means re-use in the CPU-registers.

@inbounds for j in 1:size(A,2)
        @simd for i in 1:size(A,1)
            y[i] += A[i,j] * x[j]
        end
    end

When register-tiling here, note that as the loop above loop is written, we need to:

  • load x[j] a total of size(A,2) times ā€“ minimum possible.
  • load A[i,j] a total of size(A,1) * size(A,2) times ā€“ minimum possible.
  • load and store y[i] a total of size(A,1) * size(A,2) times ā€“ an excess factor of size(A,2).

Register tiling will attack that third point. To investigate what exactly it does:

julia> using LoopVectorization

julia> M = N = 500;

julia> y = Vector{Float64}(undef, M); A = rand(M,N); x = rand(N);

julia> ls = LoopVectorization.@avx_debug for j in 1:size(A,2), i in 1:size(A,1)
           y[i] += A[i,j] * x[j]
       end;
OPS = Tuple{:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x02),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x03),:numericconstant,Symbol("##reductzero#640"),LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x04),:LoopVectorization,:vfmadd_fast,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000001, 0x0000000000000000, 0x0000000000010204, LoopVectorization.compute, 0x00, 0x04),:LoopVectorization,:reduced_add,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000001, 0x0000000000000000, 0x0000000000000503, LoopVectorization.compute, 0x00, 0x03),:LoopVectorization,:setindex!,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000006, LoopVectorization.memstore, 0x03, 0x05)}
ARF = Tuple{LoopVectorization.ArrayRefStruct{:A,Symbol("##vptr##_A")}(0x0000000000000101, 0x0000000000000201, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:x,Symbol("##vptr##_x")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:y,Symbol("##vptr##_y")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000)}
AM = Tuple{0,Tuple{},Tuple{},Tuple{},Tuple{},Tuple{(4, LoopVectorization.IntOrFloat)},Tuple{}}
LPSYM = Tuple{:j,:i}
LB = Tuple{VectorizationBase.StaticLowerUnitRange{1},VectorizationBase.StaticLowerUnitRange{1}}
vargs = (VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00005605ec236e00, (400,)), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c620, ()), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c240, ()))

julia> LoopVectorization.choose_order(ls)
 ([:i, :j], :j, :i, :i, 1, 8)

The vector [:i, :j] gives the order of the loops from outer-most to inner-most. That is, it placed the i loop as outer-most, while the j loop is inner-most.
The next two symbols give which loops are unrolled by factors corresponding to the integers:
j is unrolled by 1x (i.e., it isnā€™t), and i by 8x.
Given that j is now the innermost loop, not much sense in unrolling it.
The last symbol refers to which loop is actually vectorized.
Note that this can change as a function of types:

julia> B = A';

julia> lst = LoopVectorization.@avx_debug for j in 1:size(A,2), i in 1:size(A,1)
           y[i] += B[i,j] * x[j]
       end;
OPS = Tuple{:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x02),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x03),:numericconstant,Symbol("##reductzero#656"),LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x04),:LoopVectorization,:vfmadd_fast,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000001, 0x0000000000000000, 0x0000000000010204, LoopVectorization.compute, 0x00, 0x04),:LoopVectorization,:reduced_add,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000001, 0x0000000000000000, 0x0000000000000503, LoopVectorization.compute, 0x00, 0x03),:LoopVectorization,:setindex!,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000006, LoopVectorization.memstore, 0x03, 0x05)}
ARF = Tuple{LoopVectorization.ArrayRefStruct{:B,Symbol("##vptr##_B")}(0x0000000000000101, 0x0000000000000201, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:x,Symbol("##vptr##_x")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:y,Symbol("##vptr##_y")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000)}
AM = Tuple{0,Tuple{},Tuple{},Tuple{},Tuple{},Tuple{(4, LoopVectorization.IntOrFloat)},Tuple{}}
LPSYM = Tuple{:j,:i}
LB = Tuple{VectorizationBase.StaticLowerUnitRange{1},VectorizationBase.StaticLowerUnitRange{1}}
vargs = (VectorizationBase.RowMajorStridedPointer{Float64,1}(Ptr{Float64} @0x00005605ec236e00, (400,)), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c620, ()), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c240, ()))

julia> LoopVectorization.choose_order(lst) # j is now vectorized instead of `i`
 ([:i, :j], :j, :i, :j, 1, 8)

Anyway, back to the original ([:i, :j], :j, :i, :i, 1, 8) example,
We still load from A for each combination. We were already at the minimum, so no savings there.
As i is now the outer-most loop, we now load and store from y[i] a total of size(A,1) times.

However, we do need to load from x[j], but by cld(size(A,1),8)*size(A,2). We get a reducing factor of 8, because we get to re-use the load from x[j] for each factor by which the i loop is unrolled. 8x unrolling means we get an 8x reduction in how many times we have to load from this vector!


We can see some spectacular performance improvements from this:

julia> function gemvavx1!(y, A, x)
           @avx for j in axes(A,2), i in axes(A,1)
              y[i] += A[i,j] * x[j]
           end
       end
 gemvavx1! (generic function with 1 method)

julia> function gemvavx2!(y, A, x)
           @avx for i in axes(A,1)
               yįµ¢ = zero(eltype(y))
               for j āˆˆ axes(A,2)
                   yįµ¢ += A[i,j] * x[j]
               end
               y[i] = yįµ¢
           end
       end
 gemvavx2! (generic function with 1 method)

julia> function gemv_simd!(y, A, x)
           @inbounds for j in axes(A,2); @simd ivdep for i in axes(A,1)
              y[i] += A[i,j] * x[j]
           end; end
       end
 gemv_simd! (generic function with 1 method)

julia> M = N = 56;

julia> y1 = Vector{Float64}(undef, M); A = rand(M,N); x = rand(N); y2 = similar(y1); y3 = similar(y1);

julia> @benchmark gemv_simd!(fill!($y1, 0), $A, $x)
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     862.015 ns (0.00% GC)
  median time:      867.697 ns (0.00% GC)
  mean time:        867.474 ns (0.00% GC)
  maximum time:     1.537 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     66

julia> @benchmark gemvavx1!(fill!($y2, 0), $A, $x)
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     92.937 ns (0.00% GC)
  median time:      93.351 ns (0.00% GC)
  mean time:        93.472 ns (0.00% GC)
  maximum time:     147.397 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     954

julia> @benchmark gemvavx2!($y3, $A, $x)
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     76.973 ns (0.00% GC)
  median time:      77.107 ns (0.00% GC)
  mean time:        77.198 ns (0.00% GC)
  maximum time:     130.935 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     969

julia> y1 ā‰ˆ y2 ā‰ˆ y3
 true

But ā€“ only at these tiny sizes.
Your CPU is way faster at computing than it is at moving memory around.
And A * x requires on the order of length(A) operations, and on the same order of memory. So in the end, itā€™ll perform more or less the same as sum:

julia> function sumavx(A)
           s = zero(eltype(A))
           @avx for i āˆˆ eachindex(A)
               s += A[i]
           end
           s
       end
 sumavx (generic function with 1 method)

julia> @btime sum($A)
  192.643 ns (0 allocations: 0 bytes)
 1578.41421948359

julia> @btime sumavx($A)
  67.960 ns (0 allocations: 0 bytes)
 1578.41421948359

Since A * x and sum(A) require accessing roughly the same amount of memory.
And once these vectors stop fitting in our CPUā€™s caches, memory bandwidth is all that determines performance. Make A 500x500, and gemv_simd!, which was so slow before, is now equally fast as everything else:

julia> M = N = 500;

julia> y1 = Vector{Float64}(undef, M); A = rand(M,N); x = rand(N); y2 = similar(y1); y3 = similar(y1);

julia> @benchmark gemv_simd!(fill!($y1, 0), $A, $x)
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     82.602 Ī¼s (0.00% GC)
  median time:      85.719 Ī¼s (0.00% GC)
  mean time:        85.867 Ī¼s (0.00% GC)
  maximum time:     168.723 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gemvavx1!(fill!($y2, 0), $A, $x)
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     82.080 Ī¼s (0.00% GC)
  median time:      84.833 Ī¼s (0.00% GC)
  mean time:        84.985 Ī¼s (0.00% GC)
  maximum time:     257.037 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gemvavx2!($y3, $A, $x)
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     83.944 Ī¼s (0.00% GC)
  median time:      84.750 Ī¼s (0.00% GC)
  mean time:        84.895 Ī¼s (0.00% GC)
  maximum time:     258.025 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> y1 ā‰ˆ y2 ā‰ˆ y3
 true

julia> @btime sumavx($A)
  83.417 Ī¼s (0 allocations: 0 bytes)
 125042.8745667071

julia> @btime sumsimd($A)
  83.430 Ī¼s (0 allocations: 0 bytes)
 125042.87456670718

So the best you can do is add multithreading and break up the i vector (you canā€™t parallelize it, or youā€™re be reading and writing to y with different threads simultaneously.

function gemv_threads!(y, A, x)
    M = length(y)
    chunks = min(M >> 7, Base.Threads.nthreads())
    chunks > 1 || return gemvavx2!(y, A, x)
    chunk_size = LoopVectorization.VectorizationBase.align(cld(M, chunks), eltype(y))
    Base.Threads.@sync for c in 1:chunks
         Threads.@spawn begin
              b = chunk_size*(c-1) + 1
              e = min(M, chunk_size*c)
              r = b:e
              gemvavx2!(view(y, r), view(A, r, :), x)
         end
    end
    y
end

With this, for 500x500, I get

julia> @benchmark gemv_threads!($y3, $A, $x)
 BenchmarkTools.Trial:
  memory estimate:  2.09 KiB
  allocs estimate:  30
  --------------
  minimum time:     10.343 Ī¼s (0.00% GC)
  median time:      35.046 Ī¼s (0.00% GC)
  mean time:        35.778 Ī¼s (0.00% GC)
  maximum time:     164.298 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> using LinearAlgebra; BLAS.set_num_threads(Base.Threads.nthreads());

julia> @benchmark mul!($y2, $A, $x)
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.815 Ī¼s (0.00% GC)
  median time:      15.596 Ī¼s (0.00% GC)
  mean time:        15.964 Ī¼s (0.00% GC)
  maximum time:     82.304 Ī¼s (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> y2 ā‰ˆ y3
 true

Peak performance is similar with OpenBLAS here, but average performance is much worse.

As @stillyslalom said, it does itā€™s own modeling to make decisions and then generates code with lots of llvmcall.

13 Likes

For large matrix sizes (to the point of being memory bandwidth-limited), you can get multithreading pretty much for free via Tullio.jl, which uses threads + @avx to generate its loop kernels:

tmatmul(A, v) = @tullio b[i] := A[i,j]*v[j]
julia> @btime $A * $b
  498.300 Ī¼s (1 allocation: 11.88 KiB)

julia> @btime tmatmul($A, $b)
  469.400 Ī¼s (257 allocations: 30.81 KiB)

At that point, thereā€™s not much to be gained by further algorithmic twiddling - youā€™re solely limited by the number & bandwidth of your memory channels.

2 Likes

I said simd, not @simd.

Tullio is brilliant, not only do you get threads and avx for free, you get a concise notation that makes your code super clear and way less bug prone. Iā€™m immediately adopting it for whenever I need to write array twiddling!

Iā€™ll test the speed and post a comparison in a few hours. Thanks for this!

Ok, just to see how well it worksā€¦ hereā€™s the Tullio version:


function matmultull(A,v)
    if size(A,2) != length(v)
        throw(DimensionMismatch("second dimension of A, $size(A,2), does not match length of v, $length(v)"))
    end
    B = copy(v)
    @tullio B[i] = A[i,j] *v[j]
    return B
end

ZERO loops for that undergrad to get wrong.

Howā€™s the speed? EXACTLY the same as with my hand written undergradish loop code + @avx (and basically the same as built in *)

julia> @benchmark matmul($testA,$testV)
BenchmarkTools.Trial: 
  memory estimate:  14.80 KiB
  allocs estimate:  24
  --------------
  minimum time:     990.039 Ī¼s (0.00% GC)
  median time:      1.084 ms (0.00% GC)
  mean time:        1.102 ms (0.00% GC)
  maximum time:     1.707 ms (0.00% GC)
  --------------
  samples:          4508
  evals/sample:     1

julia> @benchmark matmultull($testA,$testV)
BenchmarkTools.Trial: 
  memory estimate:  18.55 KiB
  allocs estimate:  99
  --------------
  minimum time:     993.795 Ī¼s (0.00% GC)
  median time:      1.054 ms (0.00% GC)
  mean time:        1.089 ms (0.00% GC)
  maximum time:     2.290 ms (0.00% GC)
  --------------
  samples:          4562
  evals/sample:     1

 @benchmark $testA * $testV
BenchmarkTools.Trial: 
  memory estimate:  11.88 KiB
  allocs estimate:  1
  --------------
  minimum time:     831.787 Ī¼s (0.00% GC)
  median time:      924.277 Ī¼s (0.00% GC)
  mean time:        1.027 ms (0.00% GC)
  maximum time:     10.739 ms (0.00% GC)
  --------------
  samples:          4811
  evals/sample:     1

Thanks again!

2 Likes