LoopVectorization should still be faster than StaticArrays
at most sizes below 100, if you can avoid memcpy
and allocations. Unfortunately, that is currently easier said than done. AmulB
below calls memcpy
twice, even though they’re both unnecessary.
The first is copying an SArray
to another place on the stack; it should simply forward the pointer instead.
The other stack memory from one place to another, it should just replace all stores of the source pointer with the destination pointer.
So it’s conceivable that AmulB
will match AmulB!
's performance below…
…but those changes aren’t going to happen before Julia 1.11.
julia> using StaticArrays, LoopVectorization, BenchmarkTools
julia> @inline function AmulB!(C, A, B)
@turbo for n ∈ indices((C,B),2), m ∈ indices((C,A),1)
Cmn = zero(eltype(C))
for k ∈ indices((A,B),(2,1))
Cmn += A[m,k] * B[k,n]
end
C[m,n] = Cmn
end
return C
end
AmulB! (generic function with 1 method)
julia> @inline function AmulB(A::SMatrix{M,K,T}, B::SMatrix{K,N,S}) where {M,K,N,S,T}
SMatrix(AmulB!(MMatrix{M,N,promote_type(T,S)}(undef), A, B))
end
AmulB (generic function with 1 method)
julia> M=K=N=7; A = @SMatrix(rand(M,K)); B = @SMatrix(rand(K,N));
julia> AmulB(A,B) ≈ A*B
true
julia> @benchmark AmulB($(Ref(A))[], $(Ref(B))[])
BenchmarkTools.Trial: 10000 samples with 986 evaluations.
Range (min … max): 51.460 ns … 234.389 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 53.489 ns ┊ GC (median): 0.00%
Time (mean ± σ): 54.375 ns ± 8.143 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▇█▇▇▆▄▁
▂▂▃▃▅████████▆▅▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂ ▃
51.5 ns Histogram: frequency by time 64.8 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark $(Ref(A))[] * $(Ref(B))[]
BenchmarkTools.Trial: 10000 samples with 985 evaluations.
Range (min … max): 46.831 ns … 62.294 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 47.868 ns ┊ GC (median): 0.00%
Time (mean ± σ): 47.871 ns ± 0.941 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▅▃ ▄▂▂▆█▂
▃████▅▄▄████████▆▅▄▄▄▃▃▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▂▂▁▁▁▂▁▂▂▂ ▃
46.8 ns Histogram: frequency by time 52.4 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> Am = MMatrix(A); Bm = MMatrix(B); Cm = MMatrix{M,N,Base.promote_eltype(Am,Bm)}(undef);
julia> @benchmark AmulB!($Cm, $Am, $Bm)
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
Range (min … max): 29.005 ns … 605.039 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 29.538 ns ┊ GC (median): 0.00%
Time (mean ± σ): 29.671 ns ± 5.795 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂ ▁ ▃▆▂█▂▄▁
▁▁▁▁▁▁▁▁▃▄██▇█▇█▇███████▆█▅▇▅▆▆▃▄▃▄▄▃▃▂▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
29 ns Histogram: frequency by time 30.6 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 64 × AMD EPYC 7513 32-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 64 default, 0 interactive, 32 GC (on 64 virtual cores)
Environment:
LD_UN_PATH = /usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/
LD_LIBRARY_PATH = /usr/local/lib/x86_64-unknown-linux-gnu/:/usr/local/lib/
JULIA_PATH = @.
JULIA_NUM_THREADS = 64