Blas dot doing weird things depending on size

I’m working on some by hand matrix multiplication.

function slowMult!(A::AbstractMatrix{T},B::AbstractMatrix{T}) where T<:Number
    C = Matrix{T}(undef,size(A,1),size(B,2))
    for i in 1:size(A,1)
        for j in 1:size(B,2)
            @inbounds @views C[i,j] = dot(A[i,:], B[:,j])
        end
    end
    return C
end

This is obviously not going to be the fastest approach, but what’s really weird is that calling this function on 2 512 by 512 matrices is about 2x slower than on 514 by 514 matrices. The timing code is

T,N=Int,512
A = rand(T,N,N)
B = rand(T,N,N)
slowMult!(rand(T,1,1),rand(T,1,1)
C= @btime slowMult!($A,$B)

Could be some alignment issue.

Incidentally, the first thing I would fix is memory access order:

https://docs.julialang.org/en/v1/manual/performance-tips/#Access-arrays-in-memory-order,-along-columns-1

Not sure this is the case here but looping over multiple arrays that has lengths that are a power of two can have bad effects on the cache:

https://stackoverflow.com/questions/7905760/matrix-multiplication-small-difference-in-matrix-size-large-difference-in-timi
http://scribblethink.org/Computer/cachekiller.html
https://stackoverflow.com/questions/11868087/avoiding-powers-of-2-for-cache-friendliness
https://stackoverflow.com/questions/8547778/why-are-elementwise-additions-much-faster-in-separate-loops-than-in-a-combined-l

3 Likes

Thanks so much! Now that I know about it, it should be easy enough to fix.