Slow matrix multiplication in Julia compared to Python numpy

question

#1

Hello.
I am starting using Julia. I like its sintax, simplicity and its claimed performances.
I have been using Matlab/Octave a lot at university and now at work we are using Octave for licence reason. However Octave is really slow and here I have read about Julia.
I have tried some simple matrix multiplication (A*B, not element wise one) but it is really slow.
Octave is faster and Python numpy too.
I am using Julia version 0.6.2 on Ubuntu notebook for the test but the same problem occurs in Windows 10.
Python code below takes 0.006 seconds:

import numpy
import cProfile

n = 1000;
x=numpy.random.random((n,n))
y=numpy.random.random((n,n))
cProfile.run("x*y")

Julia code takes 0.1 seconds.

n = 1000;
a = rand(n,n);
@time a*a;

I don’t understand why is so slow.
Thank you


#2

How many threads is numpy using?
How many threads is julia (or blas in julia) using?


#3

Did you run it twice?

@time a*a
@time a*a

The first run compiles.
This shouldn’t make much of a difference here, because it’s just calling an external (already compiled) BLAS routine, and the wrapper should already be compiled in your Julia system image.

So, which BLAS are each linked to? If you have an Intel processor, you can build Julia with MKL.

Finally, x*y in Python is element wise, so of course it is much faster. Try dot or matmul.
Edit:
https://docs.scipy.org/doc/numpy-1.14.2/user/quickstart.html#basic-operations


#4

Unless I’m mistaken, x*y is elementwise multiplication in Python. Timing things in global scope in Julia is not a good way to get an accurate estimate of realistic runtimes. The right way to do it is this:

julia> using BenchmarkTools

julia> a = rand(1000, 1000);

julia> @benchmark $a .* $a # elementwise
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     2.545 ms (0.00% GC)
  median time:      3.824 ms (0.00% GC)
  mean time:        4.986 ms (26.48% GC)
  maximum time:     78.045 ms (95.63% GC)
  --------------
  samples:          999
  evals/sample:     1

julia> @benchmark $a * $a # matmul
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     51.575 ms (0.00% GC)
  median time:      57.945 ms (0.00% GC)
  mean time:        60.352 ms (3.93% GC)
  maximum time:     142.708 ms (58.50% GC)
  --------------
  samples:          83
  evals/sample:     1

On the same system, this is what I see for the same operations in Python:

>>> import numpy
>>> import cProfile
>>> a = numpy.random.random((1000, 1000))
>>> cProfile.run("a * a")
         2 function calls in 0.010 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.010    0.010    0.010    0.010 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


>>> cProfile.run("a.dot(a)")
         3 function calls in 0.055 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    0.055    0.055 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.054    0.054    0.054    0.054 {method 'dot' of 'numpy.ndarray' objects}

So matrix multiplication seems to be the same in Julia and NumPy whereas Julia is significantly faster (4x) at elementwise multiplication. I’m not sure why, perhaps the operation is cheap enough that the overhead of calling out to C is significant in Python (and doesn’t exist in Julia). Edit: no, I tried a 10,000^2 matrix and the relative performance didn’t improve much—3x instead of 4x.


#5

I don’t know how to see that. I mean read number of thread.

I tried ccall((:openblas_get_num_threads64, Base.libblas_name), Cint, ())_ in Julia and returns 4.
In numpy I don’t know how to read it.


#6

I have tried with a 10000 x 10000 matrix.
In my pc:
Numpy: 0.673 seconds
Julia: too much. I am still waiting


#7

Exactly what did you try ? Did you get past the confusion between matrix and elementwise multiplication ?


#8

On my machine, both python and Julia are using eight threads. I find that both elementwise multiplication and matrix multiplication take the same amount of time for the two languages.

In [1]: import numpy

In [2]: import cProfile

In [3]: n = 10000;

In [4]: x=numpy.random.random((n,n))
   ...: y=numpy.random.random((n,n))
   ...: 

In [5]: cProfile.run("x*y")
         2 function calls in 0.391 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.391    0.391    0.391    0.391 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}



In [6]: cProfile.run("x.dot(y)")
         3 function calls in 22.306 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.056    0.056   22.306   22.306 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1   22.250   22.250   22.250   22.250 {method 'dot' of 'numpy.ndarray' objects}
julia> n  = 10000;

julia> x = rand(n,n);

julia> y = rand(n,n);

julia> x .* y;

julia> @time x .* y;
  0.365343 seconds (30 allocations: 762.941 MiB, 10.40% gc time)

julia> @time x * y;
 23.990639 seconds (6 allocations: 762.940 MiB, 0.12% gc time)

EDIT: I committed crimes by not doing a proper benchmark and by timing in the top-level. But, in this case it doesn’t make much difference.


#9

Are you still comparing elementwise multiplication in Python with matrix multiplication in Julia? Because a 10x increase in n is expected to be a 100x slowdown in elementwise multiply, which matches the time you’re reporting for NumPy—about 0.6 seconds. Matrix multiply is super-linear in the size of the matrix, so you would expect a much bigger slowdown for that, which is exactly what you’re seeing in Julia.


#10

That is indeed elementwise product in Python.
Matrix multiply in python should be a @ a or np.matmul(a,a).


#11

Are you sure a*a is element wise?
From documentation:
https://docs.julialang.org/en/release-0.6/stdlib/linalg/


#12

I want to understand why is so slow. At the moment I haven’t a precise application


#13

No, a*a is matrix multiply in Julia but it’s elementwise multiply in Python, which is what you’re timing: cProfile.run("x*y"). That is why Python is faster and scales linearly with the size of the matrix.


#14

Python elementwise.

Julia matrix multiplication.


#15

Are you sure?
I tried:

import numpy as np
x=np.matrix('1 2 3');
y=np.matrix('1;2;3');
x*y

And it results 14.
I think is matrix multiplication.


#16

That’s a different type than numpy.random.random gives you and they do not behave the same way when you operate on them with *:

>>> type(numpy.random.random((3,3)))
<type 'numpy.ndarray'>

>>> a = numpy.ones((3, 3))

>>> type(a)
<type 'numpy.ndarray'>

>>> a * a
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

>>> a.dot(a)
array([[ 3.,  3.,  3.],
       [ 3.,  3.,  3.],
       [ 3.,  3.,  3.]])

>>> m = numpy.matrix(a)

>>> type(m)
<class 'numpy.matrixlib.defmatrix.matrix'>

>>> m * m
matrix([[ 3.,  3.,  3.],
        [ 3.,  3.,  3.],
        [ 3.,  3.,  3.]])

>>> m.dot(m)
matrix([[ 3.,  3.,  3.],
        [ 3.,  3.,  3.],
        [ 3.,  3.,  3.]])

If you find it immensely confusing that * does completely different and incompatible things in Python/NumPy for subtly different array types, I can sympathize.


#17

Ok. I understand now :smiley:
Thank you a lot.
So I’ll wait for version 1 of Julia while using 0.6 :stuck_out_tongue:

Good work!


#18

The fastest Python on the planet (Intel Python):

import numpy as np
import cProfile
import time

n = 10000;
x = np.random.random((n,n))
y = np.random.random((n,n))

t0 = time.clock()
z  = np.matrix(x) * np.matrix(y)
t1 = time.clock()

total = t1-t0
print("Total (sec) =", total)

Which times:

Total (sec) = 10.513941910715321
[Finished in 12.4s]

And Julia (Julia Pro 0.6.0 for fair comparison):

julia> x = rand(10000,10000);

julia> y = rand(10000,10000);

julia> @time x*y;
  9.327905 seconds (6 allocations: 762.940 MiB, 0.12% gc time)

Hope the comparison is clear now! Happy Julia-ing, BTW.