May I summarize what we learnt here?
- The original post had python doing
a*b
which is not a matrix multiplication. Thus the comparison was not correct. That was changed toa@b
, and the difference in performance reported is much smaller (30%), not “twice”. - The remaining difference in performance is somewhat system dependent. If I copy/paste now the original codes I get, in my machine, the same time for both of them (
~950 ms
). Yet the benchmarks oscilate a bit, because they are calling BLAS with multi-threading on the background, and probably other programs can compete for the processor usage. There may be an issue concerning the number of threads launched by the BLAS routine. - That, considering the fact that in Python the line
tmp2[...] = t@tmp1
is not allocating a new array, while the linetmp2[...] = t*tmp1
is allocating a new array in Julia. - Solving that specific allocation in Julia requires a more verbose syntax (
mul!(@view(tmp2[...]),t,tmp1)
. It might improve slightly the performance (10% maybe), but the timings vary because of the same reasons above. - Avoding other allocations can readily make the Julia code run 2x faster than the original one. That can probably be done with Python as well.
- More advanced modifications and 32 bit representation of the matrices can make the code 50x faster than the original one (Elrod
batch
version), but that is advanced indeed.
Finally, I congratulate all for the very pleasant and civilized conversation!