I think that for floats, BLAS is used, while for integers it is native Julia code. The latter could probably be made faster, but it is not a common use case so it is waiting for someone to do it.
Do you think it would be worth it for mixed type matmul to convert arguments before multiplying? I think that should speed things up a lot (with some memory downsides). The other big thing the fallback needs is better cache aware looping.
No, I would not convert. First, integers have specific overflow semantics in Julia different from float, so I am not sure what is intended and what isn’t.
Second (and more importantly), you really have to go out of your way to get a matrix with a non-concrete element type when writing idiomatic code, so I am not sure it is a common use case. I would leave it up to the user to promote if that is needed.
So Int is about 3.5x slower than Float for me, while it is 5.3x slower for you.
With integers, it uses the vpmullq instruction for integer multiplication. But this instruction appears to be slow, with a reciprocal throughput of around 1.5-3, while the vpaddq instruction is around 0.33 or 0.5.
The floating point versions use fused multiply-add instructions to combine both the multiplication and addition, and have a reciprical throughput of about 0.5.
You can think of “reciprical throughput” as how many clock cycles it takes per completed instruction if a core is executing many simultaneously. It generally takes many more clock cycles to complete any given instruction (e.g., 4 for the fma instructions), but a core can work on many simultaneously, thus the rate at which they’re completed can be much faster.
Realize that implementing a highly optimized matrix–matrix multiplication is nontrivial. Optimized BLAS libraries typically involve tens of thousands of lines of code and painstaking performance tuning. While there is no theoretical reason why this cannot be replicated in Julia, it is a huge undertaking.
Thanks, interesting. And the reason for this difference in the first place, perhaps that there’s just more demand for vectorised floating point stuff, which justifies spending a lot of silicon on it?