Fast small integer matrix multiplication

Fabrice_Rosay · September 9, 2022, 11:12am

Hi,
I need to multiply a lot of times a vector B of integers of size 64 by a 24x64 integer matrix B.
Currently the fastest way I found is simply A*B, which is very fast if the matrix are converted to Float32 (which I’d rather not as the numbers inside the result are indices to another matrix).
I tried Octavian, writing directly the loop with LoopVectorization and it is way slower (2 times), and it seems that setting the type to Int also incurs a penalty.
I was wondering as sizes are fixed if it would be possible to directly unroll the multiplication using avx instructions.
Any idea if it would be faster, or where to start ?

Current benchmark is:

@btime $A*$B
 115.134 ns (1 allocation: 160 bytes)

@btime mul!($C,$A,$B)
  94.759 ns (0 allocations: 0 bytes)

with Ocatvian

@btime matmul($A,$B)   
  270.896 ns (1 allocation: 160 bytes)

 @btime matmul!($C,$A,$B)
  221.195 ns (0 allocations: 0 bytes)

And if I set the type to say Int32
@btime $A*$B
  678.194 ns (1 allocation: 160 bytes)  6  times slower 

and with Octavian:
@btime matmul($A,$B)
  254.516 ns (1 allocation: 160 bytes)  almost the same time, 
though slightly slower```

stevengj · September 9, 2022, 11:18am

Note that Float32 is exact for integers up to maxintfloat(Float32) == 16777216, so as long as your indices (or intermediate calculations in the matrix–vector multiplication) never get larger than this you won’t have any roundoff errors to worry about.

Fabrice_Rosay · September 9, 2022, 11:34am

It seems to work, but conversion incurs a penalty of something like 15-20 ns. I would rather note use this, if possible( it is inside an alphabeta search so any time saved is precious).

stevengj · September 9, 2022, 11:46am

Yes, an unrolled and optimized kernel for your specific sizes should in principle be faster than any generic code, and you can certainly do this in Julia. The caveat is that optimizing such a beast may requires a lot of time and specialized knowledge.

LoopVectorization.jl and SIMDTypes.jl or SIMD.jl are possible starting points.

Oscar_Smith · September 9, 2022, 11:52am

it doesn’t require anything special. loopvextorization +staticarrays should generate the right code for you.

baggepinnen · September 9, 2022, 11:57am

Is it feasible to stack all B vectors horizontally, i.e.,

allB = [B1 B2 B3 ...]

and perform a single matrix-matrix multiplication

A*allB

?

Fabrice_Rosay · September 9, 2022, 2:13pm

It is indeed possible but requires to slightly change the algorithm. Also not every leaf needs to be evaluated so it needs thorough testing to know if the gain obtained by parallelizing is greater than having to evaluate all leaf.

baggepinnen · September 9, 2022, 2:26pm

Matrix-matrix multiplication is an extremely well-optimized operation, so if a problem can be expressed as one (even if that means that you’ll multiply a few vectors extra), you might very well be better off doing that than doing several matrix-vector multiplications.

Oscar_Smith · September 9, 2022, 2:35pm

Note that you only need batches of roughly 8 at a time to get most of the speedup.

Fabrice_Rosay · September 9, 2022, 2:38pm

Oscar Smith was right. Simple multiplication using loopvectorization is the fastest way. I did try it before but I had forgotten the @turbo macro …, what an idiot. Also to be fair, using StaticArrays is a huge winner. This is already a 6 times improvement over my previous attempt: whole evaluation in 72ns instead of 420ns. Great.
Thank you, long live Ataxx

mikmoore · September 9, 2022, 2:45pm

LoopVectorization.jl is probably your best bet. But StaticArrays.jl can probably get pretty close. Your matrix is too big to be a good candidate for a SMatrix from StaticArrays.jl, but could work as a Vector{SVector}.

StaticArrays can do unrolled native-precision linear algebra so you don’t need to write it yourself. Although to make it work correctly here, we have to reorder the operands in a slightly weird way.

using StaticArrays
m,n = 24,64
A = rand(Int16,m,n)
b = rand(Int16,n)
As = [SVector{m,Int16}(a) for a in eachcol(A)] # copy to Vector{SVector}
bs = SVector{n,Int16}(b) # copy to SVector

using BenchmarkTools
@btime $A * $b
# 267.925 ns (1 allocation: 96 bytes)
@btime $bs' * $As # bizarre phrasing but equivalent to the above
# 50.963 ns (0 allocations: 0 bytes)

Fabrice_Rosay · September 9, 2022, 3:03pm

On my computer for Int32 LoopVectorization is slightly faster, but for Int16 pure static arrays seems faster unless another mistake, by a large margin. Unfortunately I need at least Int32.

Topic		Replies	Views
Why is the matrix multiplication with integer matrices much slower than with float ones? Performance	6	407	June 27, 2023
Matrix Multiplication of a Large Number of Small Matrices Performance linearalgebra , tullio , column-major	12	1302	February 8, 2023
Int numerical calculation speed slower than Float? Performance	12	1761	February 17, 2020
Julia matrix-multiplication performance Performance linearalgebra	20	8666	October 30, 2022
Vector Multiplication on Submatrices Performance speed-optimization	9	535	October 14, 2022

Fast small integer matrix multiplication

Related topics