Slow matrix multiplication in Julia compared to Python numpy

Unless I’m mistaken, x*y is elementwise multiplication in Python. Timing things in global scope in Julia is not a good way to get an accurate estimate of realistic runtimes. The right way to do it is this:

julia> using BenchmarkTools

julia> a = rand(1000, 1000);

julia> @benchmark $a .* $a # elementwise
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     2.545 ms (0.00% GC)
  median time:      3.824 ms (0.00% GC)
  mean time:        4.986 ms (26.48% GC)
  maximum time:     78.045 ms (95.63% GC)
  --------------
  samples:          999
  evals/sample:     1

julia> @benchmark $a * $a # matmul
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     51.575 ms (0.00% GC)
  median time:      57.945 ms (0.00% GC)
  mean time:        60.352 ms (3.93% GC)
  maximum time:     142.708 ms (58.50% GC)
  --------------
  samples:          83
  evals/sample:     1

On the same system, this is what I see for the same operations in Python:

>>> import numpy
>>> import cProfile
>>> a = numpy.random.random((1000, 1000))
>>> cProfile.run("a * a")
         2 function calls in 0.010 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.010    0.010    0.010    0.010 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


>>> cProfile.run("a.dot(a)")
         3 function calls in 0.055 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    0.055    0.055 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.054    0.054    0.054    0.054 {method 'dot' of 'numpy.ndarray' objects}

So matrix multiplication seems to be the same in Julia and NumPy whereas Julia is significantly faster (4x) at elementwise multiplication. I’m not sure why, perhaps the operation is cheap enough that the overhead of calling out to C is significant in Python (and doesn’t exist in Julia). Edit: no, I tried a 10,000^2 matrix and the relative performance didn’t improve much—3x instead of 4x.

8 Likes