Simple Mat-Vec multiply (understanding performance, without the bugs)

dlakelan · August 11, 2020, 9:22pm

I wanted to show someone that straight ahead Julia code is reasonably fast. So I used the following:

using BenchmarkTools,LinearAlgebra

function matmul(A,v)
    if size(A,2) != length(v)
        throw(DimensionMismatch("second dimension of A, $size(A,2), does not match length of v, $length(v)"))
    end
    B = zeros(size(v))
    @inbounds for j in 1:size(A,2)
        @simd for i in 1:size(A,1)
            B[i] += A[i,j] * v[j]
        end
    end
    return B
end

const testA = rand(500,500);
const testV = rand(500);

@benchmark matmul($testA,$testV)

@benchmark $testA * $testV

norm(matmul(testA,testV) - testA*testV)

and got within about a factor of 2 performance differences:


julia> @benchmark matmul($testA,$testV)
BenchmarkTools.Trial: 
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     43.095 μs (0.00% GC)
  median time:      44.030 μs (0.00% GC)
  mean time:        44.565 μs (0.00% GC)
  maximum time:     192.452 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark $testA * $testV
BenchmarkTools.Trial: 
  memory estimate:  4.06 KiB
  allocs estimate:  1
  --------------
  minimum time:     19.737 μs (0.00% GC)
  median time:      22.607 μs (0.00% GC)
  mean time:        44.927 μs (0.00% GC)
  maximum time:     10.635 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

I used all the tips I was aware of:

constant global variables
using benchmark, with interpolation
iterate over columns first
@simd, @inbounds

Is there something I should do to add to this before trying to use it as a demo?

kristoffer.carlsson · August 11, 2020, 9:31pm

You could remove the outer simd macro and it’s always good to check the dimensions of the input arrays when turning off boundschecking.

dlakelan · August 11, 2020, 9:34pm

Removing the outer @simd macro resulted in no change, but I assume it’s not useful/doing anything in that case anyway.

I guess your point about the bounds is that Julia matrices and etc don’t have to start at 1 and go to size, right? You can start your indexes at other values etc? I haven’t looked into that very carefully.

rdeits · August 11, 2020, 9:37pm

That’s all true, but a more obvious check in this case is that size(A, 2) == length(v)

dlakelan · August 11, 2020, 9:50pm

Good point. Adding this throws the same error as native * does now:

    if size(A,2) != length(v)
        throw(DimensionMismatch("second dimension of A, $size(A,2), does not match length of v, $length(v)"))
    end

(edited original post as well)

DNF · August 11, 2020, 10:07pm

const isn’t necessary when you use interpolation.

You could also look at the @avx macro from LoopVectorization.jl. And did you consider multithreading?

dlakelan · August 11, 2020, 10:12pm

The goal was to show performance of straightforward code, not the features of the language that make it easy to parallelize stuff.

How would @avx interact with @simd etc?

DNF · August 11, 2020, 10:17pm

Just mentioning multithreading since it’s likely that your BLAS calls are threaded.

I think @avx uses simd internally.

dlakelan · August 11, 2020, 10:31pm

These were for 500x500 matrices, when I do @threads it is way longer… I increased to 1500x1500 and still threads don’t help. Definitely some big matrices… like 100000 x 100000 would benefit from threading, but when it takes 100ms or less to do the multiply the overhead of threads is too high to benefit.

On the other hand, maybe other overhead is involved… here’s unthreaded but bigger problem:

julia> const testA = rand(15000,15000);
WARNING: redefining constant testA

julia> const testV = rand(15000);
WARNING: redefining constant testV

julia> @benchmark matmul($testA,$testV)
BenchmarkTools.Trial: 
  memory estimate:  117.27 KiB
  allocs estimate:  2
  --------------
  minimum time:     124.163 ms (0.00% GC)
  median time:      131.162 ms (0.00% GC)
  mean time:        131.578 ms (0.00% GC)
  maximum time:     141.866 ms (0.00% GC)
  --------------
  samples:          39
  evals/sample:     1

julia> @benchmark $testA * $testV
BenchmarkTools.Trial: 
  memory estimate:  117.27 KiB
  allocs estimate:  2
  --------------
  minimum time:     100.122 ms (0.00% GC)
  median time:      104.185 ms (0.00% GC)
  mean time:        104.557 ms (0.00% GC)
  maximum time:     110.910 ms (0.00% GC)
  --------------
  samples:          48
  evals/sample:     1

They run very similar in time, showing that with the smaller 500x500 multiplies, probably the overhead of dispatch or something was part of the issue.

DNF · August 11, 2020, 10:45pm

It’s definitely not the case that you need that large matrices. I ran multithreading today on an operation on small vectors (800 elements) with a runtime of 1ms. With 8 threads runtime sank to 120us.

I think there’s something else that’s tripping up the threads in this case.

dlakelan · August 11, 2020, 11:09pm

did you use @threads for your example case?

stillyslalom · August 11, 2020, 11:12pm

I don’t think @avx uses @simd at all internally - it does its own vectorization instead of passing the buck to LLVM. There’s been a longstanding hope among Julia devs that LLVM could do the same thing if the Julia IR provides enough information, but that hasn’t come to fruition yet, so @Elrod took the matter into his own hands.

using LoopVectorization, BenchmarkTools

function jgemvavx(𝐀, 𝐱)
    𝐲 = copy(𝐱)
    @avx for i ∈ eachindex(𝐲)
        𝐲ᵢ = zero(eltype(𝐲))
        for j ∈ eachindex(𝐱)
            𝐲ᵢ += 𝐀[i,j] * 𝐱[j]
        end
        𝐲[i] = 𝐲ᵢ
    end
end

julia> A = rand(500, 500); v = rand(500);

julia> @btime matmul($A, $b);
  19.400 μs (1 allocation: 4.06 KiB)

julia> @btime $A * $b;
  59.999 μs (1 allocation: 4.06 KiB)

(pulled from the benchmarks)

dlakelan · August 11, 2020, 11:23pm

OK,


function matmul(A,v)
    if size(A,2) != length(v)
        throw(DimensionMismatch("second dimension of A, $size(A,2), does not match length of v, $length(v)"))
    end
    B = zeros(size(v))
    @inbounds for j in 1:size(A,2)
        @avx for i in 1:size(A,1)
            B[i] += A[i,j] * v[j]
        end
    end
    return B
end

gives me essentially equal time to using * for 1500x1500 matrix


julia> @benchmark matmul($testA,$testV)
BenchmarkTools.Trial: 
  memory estimate:  11.88 KiB
  allocs estimate:  1
  --------------
  minimum time:     1.007 ms (0.00% GC)
  median time:      1.047 ms (0.00% GC)
  mean time:        1.061 ms (0.00% GC)
  maximum time:     1.765 ms (0.00% GC)
  --------------
  samples:          4690
  evals/sample:     1

julia> @benchmark $testA * $testV
BenchmarkTools.Trial: 
  memory estimate:  11.88 KiB
  allocs estimate:  1
  --------------
  minimum time:     902.513 μs (0.00% GC)
  median time:      969.267 μs (0.00% GC)
  mean time:        1.014 ms (0.00% GC)
  maximum time:     18.704 ms (0.00% GC)
  --------------
  samples:          4873
  evals/sample:     1

Nice!

When I switch to trying to thread it on 4 threads…


function matmul(A,v)
    if size(A,2) != length(v)
        throw(DimensionMismatch("second dimension of A, $size(A,2), does not match length of v, $length(v)"))
    end
    B = zeros(size(v))
    Threads.@threads for j in 1:size(A,2)
        @avx for i in 1:size(A,1)
            B[i] += A[i,j] * v[j]
        end
    end
    return B
end

it does absolutely nothing to the time / slightly longer

julia> @benchmark matmul($testA,$testV)
BenchmarkTools.Trial: 
  memory estimate:  14.80 KiB
  allocs estimate:  24
  --------------
  minimum time:     996.124 μs (0.00% GC)
  median time:      1.080 ms (0.00% GC)
  mean time:        1.094 ms (0.00% GC)
  maximum time:     1.742 ms (0.00% GC)
  --------------
  samples:          4545
  evals/sample:     1

Elrod · August 12, 2020, 2:08am

The first thing I would note here is the benchmark:

Notice how performance plummets beyond around 60x60?
This is because LoopVectorization does register tiling, but does not yet do cache tiling. I plan for this to change in the future, but for now, that means we must do the cache-tiling ourselves.

What does “tiling” mean? It means operating block-wise to get data re-use. “Cache-tiling” means re-use in the CPU’s cache, and “register-tiling” means re-use in the CPU-registers.

@inbounds for j in 1:size(A,2)
        @simd for i in 1:size(A,1)
            y[i] += A[i,j] * x[j]
        end
    end

When register-tiling here, note that as the loop above loop is written, we need to:

load x[j] a total of size(A,2) times – minimum possible.
load A[i,j] a total of size(A,1) * size(A,2) times – minimum possible.
load and store y[i] a total of size(A,1) * size(A,2) times – an excess factor of size(A,2).

Register tiling will attack that third point. To investigate what exactly it does:

julia> using LoopVectorization

julia> M = N = 500;

julia> y = Vector{Float64}(undef, M); A = rand(M,N); x = rand(N);

julia> ls = LoopVectorization.@avx_debug for j in 1:size(A,2), i in 1:size(A,1)
           y[i] += A[i,j] * x[j]
       end;
OPS = Tuple{:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x02),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x03),:numericconstant,Symbol("##reductzero#640"),LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x04),:LoopVectorization,:vfmadd_fast,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000001, 0x0000000000000000, 0x0000000000010204, LoopVectorization.compute, 0x00, 0x04),:LoopVectorization,:reduced_add,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000001, 0x0000000000000000, 0x0000000000000503, LoopVectorization.compute, 0x00, 0x03),:LoopVectorization,:setindex!,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000006, LoopVectorization.memstore, 0x03, 0x05)}
ARF = Tuple{LoopVectorization.ArrayRefStruct{:A,Symbol("##vptr##_A")}(0x0000000000000101, 0x0000000000000201, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:x,Symbol("##vptr##_x")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:y,Symbol("##vptr##_y")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000)}
AM = Tuple{0,Tuple{},Tuple{},Tuple{},Tuple{},Tuple{(4, LoopVectorization.IntOrFloat)},Tuple{}}
LPSYM = Tuple{:j,:i}
LB = Tuple{VectorizationBase.StaticLowerUnitRange{1},VectorizationBase.StaticLowerUnitRange{1}}
vargs = (VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x00005605ec236e00, (400,)), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c620, ()), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c240, ()))

julia> LoopVectorization.choose_order(ls)
 ([:i, :j], :j, :i, :i, 1, 8)

The vector [:i, :j] gives the order of the loops from outer-most to inner-most. That is, it placed the i loop as outer-most, while the j loop is inner-most.
The next two symbols give which loops are unrolled by factors corresponding to the integers:
j is unrolled by 1x (i.e., it isn’t), and i by 8x.
Given that j is now the innermost loop, not much sense in unrolling it.
The last symbol refers to which loop is actually vectorized.
Note that this can change as a function of types:

julia> B = A';

julia> lst = LoopVectorization.@avx_debug for j in 1:size(A,2), i in 1:size(A,1)
           y[i] += B[i,j] * x[j]
       end;
OPS = Tuple{:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x01),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x02, 0x02),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x03, 0x03),:numericconstant,Symbol("##reductzero#656"),LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x04),:LoopVectorization,:vfmadd_fast,LoopVectorization.OperationStruct(0x0000000000000021, 0x0000000000000001, 0x0000000000000000, 0x0000000000010204, LoopVectorization.compute, 0x00, 0x04),:LoopVectorization,:reduced_add,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000001, 0x0000000000000000, 0x0000000000000503, LoopVectorization.compute, 0x00, 0x03),:LoopVectorization,:setindex!,LoopVectorization.OperationStruct(0x0000000000000002, 0x0000000000000000, 0x0000000000000000, 0x0000000000000006, LoopVectorization.memstore, 0x03, 0x05)}
ARF = Tuple{LoopVectorization.ArrayRefStruct{:B,Symbol("##vptr##_B")}(0x0000000000000101, 0x0000000000000201, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:x,Symbol("##vptr##_x")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000),LoopVectorization.ArrayRefStruct{:y,Symbol("##vptr##_y")}(0x0000000000000001, 0x0000000000000002, 0x0000000000000000)}
AM = Tuple{0,Tuple{},Tuple{},Tuple{},Tuple{},Tuple{(4, LoopVectorization.IntOrFloat)},Tuple{}}
LPSYM = Tuple{:j,:i}
LB = Tuple{VectorizationBase.StaticLowerUnitRange{1},VectorizationBase.StaticLowerUnitRange{1}}
vargs = (VectorizationBase.RowMajorStridedPointer{Float64,1}(Ptr{Float64} @0x00005605ec236e00, (400,)), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c620, ()), VectorizationBase.PackedStridedPointer{Float64,0}(Ptr{Float64} @0x00007f624341c240, ()))

julia> LoopVectorization.choose_order(lst) # j is now vectorized instead of `i`
 ([:i, :j], :j, :i, :j, 1, 8)

Anyway, back to the original ([:i, :j], :j, :i, :i, 1, 8) example,
We still load from A for each combination. We were already at the minimum, so no savings there.
As i is now the outer-most loop, we now load and store from y[i] a total of size(A,1) times.

However, we do need to load from x[j], but by cld(size(A,1),8)*size(A,2). We get a reducing factor of 8, because we get to re-use the load from x[j] for each factor by which the i loop is unrolled. 8x unrolling means we get an 8x reduction in how many times we have to load from this vector!

We can see some spectacular performance improvements from this:

julia> function gemvavx1!(y, A, x)
           @avx for j in axes(A,2), i in axes(A,1)
              y[i] += A[i,j] * x[j]
           end
       end
 gemvavx1! (generic function with 1 method)

julia> function gemvavx2!(y, A, x)
           @avx for i in axes(A,1)
               yᵢ = zero(eltype(y))
               for j ∈ axes(A,2)
                   yᵢ += A[i,j] * x[j]
               end
               y[i] = yᵢ
           end
       end
 gemvavx2! (generic function with 1 method)

julia> function gemv_simd!(y, A, x)
           @inbounds for j in axes(A,2); @simd ivdep for i in axes(A,1)
              y[i] += A[i,j] * x[j]
           end; end
       end
 gemv_simd! (generic function with 1 method)

julia> M = N = 56;

julia> y1 = Vector{Float64}(undef, M); A = rand(M,N); x = rand(N); y2 = similar(y1); y3 = similar(y1);

julia> @benchmark gemv_simd!(fill!($y1, 0), $A, $x)
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     862.015 ns (0.00% GC)
  median time:      867.697 ns (0.00% GC)
  mean time:        867.474 ns (0.00% GC)
  maximum time:     1.537 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     66

julia> @benchmark gemvavx1!(fill!($y2, 0), $A, $x)
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     92.937 ns (0.00% GC)
  median time:      93.351 ns (0.00% GC)
  mean time:        93.472 ns (0.00% GC)
  maximum time:     147.397 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     954

julia> @benchmark gemvavx2!($y3, $A, $x)
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     76.973 ns (0.00% GC)
  median time:      77.107 ns (0.00% GC)
  mean time:        77.198 ns (0.00% GC)
  maximum time:     130.935 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     969

julia> y1 ≈ y2 ≈ y3
 true

But – only at these tiny sizes.
Your CPU is way faster at computing than it is at moving memory around.
And A * x requires on the order of length(A) operations, and on the same order of memory. So in the end, it’ll perform more or less the same as sum:

julia> function sumavx(A)
           s = zero(eltype(A))
           @avx for i ∈ eachindex(A)
               s += A[i]
           end
           s
       end
 sumavx (generic function with 1 method)

julia> @btime sum($A)
  192.643 ns (0 allocations: 0 bytes)
 1578.41421948359

julia> @btime sumavx($A)
  67.960 ns (0 allocations: 0 bytes)
 1578.41421948359

Since A * x and sum(A) require accessing roughly the same amount of memory.
And once these vectors stop fitting in our CPU’s caches, memory bandwidth is all that determines performance. Make A 500x500, and gemv_simd!, which was so slow before, is now equally fast as everything else:

julia> M = N = 500;

julia> y1 = Vector{Float64}(undef, M); A = rand(M,N); x = rand(N); y2 = similar(y1); y3 = similar(y1);

julia> @benchmark gemv_simd!(fill!($y1, 0), $A, $x)
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     82.602 μs (0.00% GC)
  median time:      85.719 μs (0.00% GC)
  mean time:        85.867 μs (0.00% GC)
  maximum time:     168.723 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gemvavx1!(fill!($y2, 0), $A, $x)
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     82.080 μs (0.00% GC)
  median time:      84.833 μs (0.00% GC)
  mean time:        84.985 μs (0.00% GC)
  maximum time:     257.037 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gemvavx2!($y3, $A, $x)
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     83.944 μs (0.00% GC)
  median time:      84.750 μs (0.00% GC)
  mean time:        84.895 μs (0.00% GC)
  maximum time:     258.025 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> y1 ≈ y2 ≈ y3
 true

julia> @btime sumavx($A)
  83.417 μs (0 allocations: 0 bytes)
 125042.8745667071

julia> @btime sumsimd($A)
  83.430 μs (0 allocations: 0 bytes)
 125042.87456670718

So the best you can do is add multithreading and break up the i vector (you can’t parallelize it, or you’re be reading and writing to y with different threads simultaneously.

function gemv_threads!(y, A, x)
    M = length(y)
    chunks = min(M >> 7, Base.Threads.nthreads())
    chunks > 1 || return gemvavx2!(y, A, x)
    chunk_size = LoopVectorization.VectorizationBase.align(cld(M, chunks), eltype(y))
    Base.Threads.@sync for c in 1:chunks
         Threads.@spawn begin
              b = chunk_size*(c-1) + 1
              e = min(M, chunk_size*c)
              r = b:e
              gemvavx2!(view(y, r), view(A, r, :), x)
         end
    end
    y
end

With this, for 500x500, I get

julia> @benchmark gemv_threads!($y3, $A, $x)
 BenchmarkTools.Trial:
  memory estimate:  2.09 KiB
  allocs estimate:  30
  --------------
  minimum time:     10.343 μs (0.00% GC)
  median time:      35.046 μs (0.00% GC)
  mean time:        35.778 μs (0.00% GC)
  maximum time:     164.298 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> using LinearAlgebra; BLAS.set_num_threads(Base.Threads.nthreads());

julia> @benchmark mul!($y2, $A, $x)
 BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.815 μs (0.00% GC)
  median time:      15.596 μs (0.00% GC)
  mean time:        15.964 μs (0.00% GC)
  maximum time:     82.304 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> y2 ≈ y3
 true

Peak performance is similar with OpenBLAS here, but average performance is much worse.

As @stillyslalom said, it does it’s own modeling to make decisions and then generates code with lots of llvmcall.

stillyslalom · August 12, 2020, 3:10am

For large matrix sizes (to the point of being memory bandwidth-limited), you can get multithreading pretty much for free via Tullio.jl, which uses threads + @avx to generate its loop kernels:

tmatmul(A, v) = @tullio b[i] := A[i,j]*v[j]

julia> @btime $A * $b
  498.300 μs (1 allocation: 11.88 KiB)

julia> @btime tmatmul($A, $b)
  469.400 μs (257 allocations: 30.81 KiB)

At that point, there’s not much to be gained by further algorithmic twiddling - you’re solely limited by the number & bandwidth of your memory channels.

DNF · August 12, 2020, 4:35am

I said simd, not @simd.

dlakelan · August 12, 2020, 11:22am

Tullio is brilliant, not only do you get threads and avx for free, you get a concise notation that makes your code super clear and way less bug prone. I’m immediately adopting it for whenever I need to write array twiddling!

I’ll test the speed and post a comparison in a few hours. Thanks for this!

Ok, just to see how well it works… here’s the Tullio version:


function matmultull(A,v)
    if size(A,2) != length(v)
        throw(DimensionMismatch("second dimension of A, $size(A,2), does not match length of v, $length(v)"))
    end
    B = copy(v)
    @tullio B[i] = A[i,j] *v[j]
    return B
end

ZERO loops for that undergrad to get wrong.

How’s the speed? EXACTLY the same as with my hand written undergradish loop code + @avx (and basically the same as built in *)

julia> @benchmark matmul($testA,$testV)
BenchmarkTools.Trial: 
  memory estimate:  14.80 KiB
  allocs estimate:  24
  --------------
  minimum time:     990.039 μs (0.00% GC)
  median time:      1.084 ms (0.00% GC)
  mean time:        1.102 ms (0.00% GC)
  maximum time:     1.707 ms (0.00% GC)
  --------------
  samples:          4508
  evals/sample:     1

julia> @benchmark matmultull($testA,$testV)
BenchmarkTools.Trial: 
  memory estimate:  18.55 KiB
  allocs estimate:  99
  --------------
  minimum time:     993.795 μs (0.00% GC)
  median time:      1.054 ms (0.00% GC)
  mean time:        1.089 ms (0.00% GC)
  maximum time:     2.290 ms (0.00% GC)
  --------------
  samples:          4562
  evals/sample:     1

 @benchmark $testA * $testV
BenchmarkTools.Trial: 
  memory estimate:  11.88 KiB
  allocs estimate:  1
  --------------
  minimum time:     831.787 μs (0.00% GC)
  median time:      924.277 μs (0.00% GC)
  mean time:        1.027 ms (0.00% GC)
  maximum time:     10.739 ms (0.00% GC)
  --------------
  samples:          4811
  evals/sample:     1

Thanks again!

Topic		Replies	Views
LoopVec, Tullio losing to Matrix multiplication Performance	9	735	July 25, 2024
Julia matrix-multiplication performance Performance linearalgebra	20	8677	October 30, 2022
Speed comparison matrix multiplication in Julia Performance question , linearalgebra , optimization , tullio	45	3267	August 19, 2021
@inbounds: is the compiler now so smart that this is no longer necessary? Performance	33	2908	July 16, 2018
[ANN] LoopVectorization Package Announcements	157	23288	May 27, 2020

Simple Mat-Vec multiply (understanding performance, without the bugs)

Related topics