Why is this simple function twice as slow as its Python version

May I summarize what we learnt here?

  1. The original post had python doing a*b which is not a matrix multiplication. Thus the comparison was not correct. That was changed to a@b, and the difference in performance reported is much smaller (30%), not “twice”.
  2. The remaining difference in performance is somewhat system dependent. If I copy/paste now the original codes I get, in my machine, the same time for both of them (~950 ms). Yet the benchmarks oscilate a bit, because they are calling BLAS with multi-threading on the background, and probably other programs can compete for the processor usage. There may be an issue concerning the number of threads launched by the BLAS routine.
  3. That, considering the fact that in Python the line tmp2[...] = t@tmp1 is not allocating a new array, while the line tmp2[...] = t*tmp1 is allocating a new array in Julia.
  4. Solving that specific allocation in Julia requires a more verbose syntax (mul!(@view(tmp2[...]),t,tmp1). It might improve slightly the performance (10% maybe), but the timings vary because of the same reasons above.
  5. Avoding other allocations can readily make the Julia code run 2x faster than the original one. That can probably be done with Python as well.
  6. More advanced modifications and 32 bit representation of the matrices can make the code 50x faster than the original one (Elrod batch version), but that is advanced indeed.

Finally, I congratulate all for the very pleasant and civilized conversation! :slight_smile:

31 Likes