Unless I’m mistaken, x*y
is elementwise multiplication in Python. Timing things in global scope in Julia is not a good way to get an accurate estimate of realistic runtimes. The right way to do it is this:
julia> using BenchmarkTools
julia> a = rand(1000, 1000);
julia> @benchmark $a .* $a # elementwise
BenchmarkTools.Trial:
memory estimate: 7.63 MiB
allocs estimate: 2
--------------
minimum time: 2.545 ms (0.00% GC)
median time: 3.824 ms (0.00% GC)
mean time: 4.986 ms (26.48% GC)
maximum time: 78.045 ms (95.63% GC)
--------------
samples: 999
evals/sample: 1
julia> @benchmark $a * $a # matmul
BenchmarkTools.Trial:
memory estimate: 7.63 MiB
allocs estimate: 2
--------------
minimum time: 51.575 ms (0.00% GC)
median time: 57.945 ms (0.00% GC)
mean time: 60.352 ms (3.93% GC)
maximum time: 142.708 ms (58.50% GC)
--------------
samples: 83
evals/sample: 1
On the same system, this is what I see for the same operations in Python:
>>> import numpy
>>> import cProfile
>>> a = numpy.random.random((1000, 1000))
>>> cProfile.run("a * a")
2 function calls in 0.010 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.010 0.010 0.010 0.010 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
>>> cProfile.run("a.dot(a)")
3 function calls in 0.055 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 0.055 0.055 <string>:1(<module>)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.054 0.054 0.054 0.054 {method 'dot' of 'numpy.ndarray' objects}
So matrix multiplication seems to be the same in Julia and NumPy whereas Julia is significantly faster (4x) at elementwise multiplication. I’m not sure why, perhaps the operation is cheap enough that the overhead of calling out to C is significant in Python (and doesn’t exist in Julia). Edit: no, I tried a 10,000^2 matrix and the relative performance didn’t improve much—3x instead of 4x.