Blas dot doing weird things depending on size

I’m working on some by hand matrix multiplication.

function slowMult!(A::AbstractMatrix{T},B::AbstractMatrix{T}) where T<:Number
    C = Matrix{T}(undef,size(A,1),size(B,2))
    for i in 1:size(A,1)
        for j in 1:size(B,2)
            @inbounds @views C[i,j] = dot(A[i,:], B[:,j])
    return C

This is obviously not going to be the fastest approach, but what’s really weird is that calling this function on 2 512 by 512 matrices is about 2x slower than on 514 by 514 matrices. The timing code is

A = rand(T,N,N)
B = rand(T,N,N)
C= @btime slowMult!($A,$B)

Could be some alignment issue.

Incidentally, the first thing I would fix is memory access order:,-along-columns-1

Not sure this is the case here but looping over multiple arrays that has lengths that are a power of two can have bad effects on the cache:


Thanks so much! Now that I know about it, it should be easy enough to fix.