Fast small integer matrix multiplication

Hi,
I need to multiply a lot of times a vector B of integers of size 64 by a 24x64 integer matrix B.
Currently the fastest way I found is simply A*B, which is very fast if the matrix are converted to Float32 (which I’d rather not as the numbers inside the result are indices to another matrix).
I tried Octavian, writing directly the loop with LoopVectorization and it is way slower (2 times), and it seems that setting the type to Int also incurs a penalty.
I was wondering as sizes are fixed if it would be possible to directly unroll the multiplication using avx instructions.
Any idea if it would be faster, or where to start ?

Current benchmark is:

@btime $A*$B
 115.134 ns (1 allocation: 160 bytes)

@btime mul!($C,$A,$B)
  94.759 ns (0 allocations: 0 bytes)

with Ocatvian

@btime matmul($A,$B)   
  270.896 ns (1 allocation: 160 bytes)

 @btime matmul!($C,$A,$B)
  221.195 ns (0 allocations: 0 bytes)

And if I set the type to say Int32
@btime $A*$B
  678.194 ns (1 allocation: 160 bytes)  6  times slower 

and with Octavian:
@btime matmul($A,$B)
  254.516 ns (1 allocation: 160 bytes)  almost the same time, 
though slightly slower```
2 Likes

Note that Float32 is exact for integers up to maxintfloat(Float32) == 16777216, so as long as your indices (or intermediate calculations in the matrix–vector multiplication) never get larger than this you won’t have any roundoff errors to worry about.

It seems to work, but conversion incurs a penalty of something like 15-20 ns. I would rather note use this, if possible( it is inside an alphabeta search so any time saved is precious).

Yes, an unrolled and optimized kernel for your specific sizes should in principle be faster than any generic code, and you can certainly do this in Julia. The caveat is that optimizing such a beast may requires a lot of time and specialized knowledge.

LoopVectorization.jl and SIMDTypes.jl or SIMD.jl are possible starting points.

1 Like

it doesn’t require anything special. loopvextorization +staticarrays should generate the right code for you.

1 Like

Is it feasible to stack all B vectors horizontally, i.e.,

allB = [B1 B2 B3 ...]

and perform a single matrix-matrix multiplication

A*allB

?

It is indeed possible but requires to slightly change the algorithm. Also not every leaf needs to be evaluated so it needs thorough testing to know if the gain obtained by parallelizing is greater than having to evaluate all leaf.

Matrix-matrix multiplication is an extremely well-optimized operation, so if a problem can be expressed as one (even if that means that you’ll multiply a few vectors extra), you might very well be better off doing that than doing several matrix-vector multiplications.

Note that you only need batches of roughly 8 at a time to get most of the speedup.

Oscar Smith was right. Simple multiplication using loopvectorization is the fastest way. I did try it before but I had forgotten the @turbo macro …, what an idiot. Also to be fair, using StaticArrays is a huge winner. This is already a 6 times improvement over my previous attempt: whole evaluation in 72ns instead of 420ns. Great.
Thank you, long live Ataxx

LoopVectorization.jl is probably your best bet. But StaticArrays.jl can probably get pretty close. Your matrix is too big to be a good candidate for a SMatrix from StaticArrays.jl, but could work as a Vector{SVector}.

StaticArrays can do unrolled native-precision linear algebra so you don’t need to write it yourself. Although to make it work correctly here, we have to reorder the operands in a slightly weird way.

using StaticArrays
m,n = 24,64
A = rand(Int16,m,n)
b = rand(Int16,n)
As = [SVector{m,Int16}(a) for a in eachcol(A)] # copy to Vector{SVector}
bs = SVector{n,Int16}(b) # copy to SVector

using BenchmarkTools
@btime $A * $b
# 267.925 ns (1 allocation: 96 bytes)
@btime $bs' * $As # bizarre phrasing but equivalent to the above
# 50.963 ns (0 allocations: 0 bytes)

On my computer for Int32 LoopVectorization is slightly faster, but for Int16 pure static arrays seems faster unless another mistake, by a large margin. Unfortunately I need at least Int32.