May I summarize what we learnt here?
- The original post had python doing a*bwhich is not a matrix multiplication. Thus the comparison was not correct. That was changed toa@b, and the difference in performance reported is much smaller (30%), not “twice”.
- The remaining difference in performance is somewhat system dependent. If I copy/paste now the original codes I get, in my machine, the same time for both of them (~950 ms). Yet the benchmarks oscilate a bit, because they are calling BLAS with multi-threading on the background, and probably other programs can compete for the processor usage. There may be an issue concerning the number of threads launched by the BLAS routine.
- That, considering the fact that in Python the line tmp2[...] = t@tmp1is not allocating a new array, while the linetmp2[...] = t*tmp1is allocating a new array in Julia.
- Solving that specific allocation in Julia requires a more verbose syntax (mul!(@view(tmp2[...]),t,tmp1). It might improve slightly the performance (10% maybe), but the timings vary because of the same reasons above.
- Avoding other allocations can readily make the Julia code run 2x faster than the original one. That can probably be done with Python as well.
- More advanced modifications and 32 bit representation of the matrices can make the code 50x faster than the original one (Elrod batchversion), but that is advanced indeed.
Finally, I congratulate all for the very pleasant and civilized conversation! 