Why is this simple function twice as slow as its Python version

That doesn’t seem contradictory to what I wrote.

Wait. This is not the same as rand!(t), but calculates only a single random value. (Did you mean t .= rand.()?) What is the desired behaviour? And which is faster.

I think rand! is probably better, as it will create the correct eltype automatically.

3 Likes

AFAIK, the only performance penalty from 64 bit integer operations vs 32 bit is that 64-bit integer operations on the lettered registers (e.g eax, edx, etc) require 1 less byte than the 64 bit versions (rax, rdx, etc).
Intel’s optimization manuals recommend using 32-bit integers when possible for this reason.

(Note that I’m talking about operations in integer registers, not vector registers. This is also an x86_64 specific quirk. non-thumb ARM instructions are all exactly 4 bytes, for example)

I think it’d normally be hard to tell the difference in benchmarks that one program is a few bytes smaller.
The BLAS libraries will still be run in 64-bit mode, which is necessary for having access to more than just 8 integer and 8 vector registers. More than 8 vector registers is critical for register tiling, making it essential for good BLAS performance.

The BLAS API assumes that a dimension is contiguous, but they do not assume that the stride of the next dimension equals the size of the contiguous dimension.
What I mean is:

A = view(rand(100,100), 10:70, :);
B = A'

A and B are both fine. Because A is not transposed, all that is needed is stride(A,2) >= size(A,1), and stride(A,1) == 1.
Because B is transposed, what’s needed is stride(B,2) == 1 and stride(B,1) >= size(B,2).

However, this means that

C = view(rand(100,100), 10:2:70, :);

is not okay, because stride(A,1) != 1.
For reference, NETLIB dgemm docs. That’s the standard BLAS API used by OpenBLAS/MKL/etc, but note that libraries could provide extra APIs.
I think (but am not sure) that BLIS’s object API does support non-contiuous strided arrays, so that C above would work.

4 Likes

@Elrod, You’re just repeating what I said:

  1. ILP64 means the pointers (Pointers are not the AVX units) calculations is done using 64 bit arithmetic. It is the same as the first days of 64 Bit where 64 Bit applications were slower since moving to 64 bit pointers made some slow downs.
  2. GEMM can be configured for the step between columns, but each column must be contiguous. This is what I recall. Intel likes calling it Stride (Column / Row Stride). It is used similarly in the Image Processing world to allow working on ROI (Region of Interest) within a larger memory buffer.

What I said about (2) is that I’d go and see what numpy is doing.
It seems they infer the contiguous columns and hence automatically use that feature to prevent allocations from the GEMM call. I think if it is done automatically in numpy it should be copied to Julia.
It makes no sense to use view here as it gives the notion that even if data is completely non contiguous the GEMM call will be non allocating (I’d even check if it doesn’t do errors or mistakes).

2 Likes

Yes, that. Fixed there. And rand! is better, yes.

What numpy does that is nice is to interpret the c = a*b notation such that it is done in place if c exists. Not more than that I think.

It was said to do even more:

It seems to be more intelligent than only work on pre defined array.
It seems even if the LHS is a slice with contiguous memory (At least columns) it handles the operation inplace.
It would be great if mul!() would do the same.

Well, the good thing about Julia is that one can do fast things without waiting for Base or any other libraries to change. So, if someone is interested in a more intelligent behaviour of mul! operation, he can create a package, which introduces smartmul! with close to numpy implementation, let other use it, find all issues and problems with this approach and if it is really good improvement, propose it as PR to Base or LinearAlgebra. This way better version can be used from day 1.

2 Likes

Am I missing something or mul! is always in-place?

While this flexibility is great if you’re advanced user it has the opposite effect if you’re just a user.
We don’t have the allocations info of Python for the above code, but if the analysis in Why is this simple function twice as slow as its Python version - #73 by Semi is correct about numpy this is the reason for the speed of Python.

The issue with Julia is it was built on the promise “Solving the 2 Languages Problem” by promising very fast language.
In practice, in too many cases, you need to use very advanced concepts to get the performance you get with MATLAB / Python out of the box.
More than that, it seems the “Dynamic Range” of the Julia language is so big that in practice we have 2 lagnuages: Julia as dynamic elegant language (Maybe Juliade) and Julia Performance (Maybe Juliape) language . They are both under the Julia umbrella but in practice they are very different.

@giordano , I haven’t check, but what happen if you send a slice to mul!()? I think the slice becomes a newly allocated buffer, the matrix multiplication is applied and then the updated buffer is copied into the sliced array.
In Why is this simple function twice as slow as its Python version - #78 by lmiq @lmiq created a view of the slice and sent this view to matmul!().
It seems in Python you can do it on a slice directly with no allocation (Again, according to @Semi ).

1 Like

Slicing always copies, and that happens before mul! operates. You have to use @view to not create a copy. It doesn’t have anything to do with mul!

5 Likes

Correcting:

What numpy does is, if one provides a slice, it mutates the result in place:

In [10]: c = zeros((4,4))

In [11]: a = rand(3,3); b = rand(3,3);

In [14]: c[0:3,0:3] = a@b

In [15]: c
Out[15]: 
array([[0.67484508, 1.04255808, 0.2950406 , 0.        ],
       [0.74239244, 1.13492857, 0.38982699, 0.        ],
       [0.8034559 , 1.18725292, 0.49251803, 0.        ],
       [0.        , 0.        , 0.        , 0.        ]])


But if the output is not set to a slice, it does not (continuing from the same code):

In [16]: c = a@b

In [17]: c
Out[17]: 
array([[0.67484508, 1.04255808, 0.2950406 ],
       [0.74239244, 1.13492857, 0.38982699],
       [0.8034559 , 1.18725292, 0.49251803]])

Anyway, that has to be taken with a lot of care, because that syntax sugar only works for that line of code:

In [25]: c = zeros((3,3)); a = rand(2,2); b = rand(2,2);

In [26]: def matmul(c,a,b):
    ...:     c = a@b

In [31]: matmul(c[0:2,0:2],a,b)

In [32]: c
Out[32]: 
array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

I don’t know how to pass a slice of c to the function, but clearly some other notation is needed (which in Julia would be a view).

That means that while one can do c[0:2,0:2] = a@b in place in python without thinking, anything more complicated than that needs to rely on a deeper understanding of how python and numpy manage the passing of arrays and slices.

Add on:

The general solution in python is to use a more explicit use of the multiplication routine:

In [47]: def matmul(c,a,b):
    ...:     numpy.matmul(a,b,out=c)
    ...: 

In [51]: c = zeros((3,3)); a = rand(2,2); b = rand(2,2);

In [52]: matmul(c[0:2,0:2],a,b)

In [53]: c
Out[53]: 
array([[0.64204481, 0.04420175, 0.        ],
       [1.03915034, 0.42034541, 0.        ],
       [0.        , 0.        , 0.        ]])

One needs to know what one is doing anyway. The difference from Julia is that the slice c[0:2,0:2] does not create a new array, it directly creates a view, while in Julia one needs to explicit set the view with @view(c[1:2,1:2]).

In summary:

The differences are:

  1. numpy has a syntax sugar to mutate in place if c[0:2,0:2] = a@b is used for a matrix multiplication. Maybe this is interesting to implement in Julia? Sounds feasible, although somewhat inconsistent with the rest of the syntax.

  2. numpy creates views by default, while Julia creates copies by default, when creating slices. That is a choice. Sometimes one is better, sometimes the other is better.

4 Likes

A sailing train, nice image :wink:

2 Likes

FWIW, Octavian can already handle non-contiguous array slices with 0 allocations:

julia> using Octavian, LinearAlgebra

julia> C = rand(2,1000,1000); A = rand(2,1000,1000); B = rand(2,1000,1000);

julia> @benchmark matmul!(view($C,1,:,:), view($A,1,:,:), view($B,1,:,:))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.637 ms (0.00% GC)
  median time:      1.669 ms (0.00% GC)
  mean time:        1.668 ms (0.00% GC)
  maximum time:     1.885 ms (0.00% GC)
  --------------
  samples:          2979
  evals/sample:     1

julia> 2e-9*1000^3 / 1.637e-3
1221.7470983506414

julia> BLAS.set_num_threads(18);

julia> view(C,1,:,:) ≈ view(A,1,:,:)*view(B,1,:,:)
true

julia> @benchmark mul!(view($C,1,:,:), view($A,1,:,:), view($B,1,:,:))
BenchmarkTools.Trial:
  memory estimate:  336 bytes
  allocs estimate:  6
  --------------
  minimum time:     718.482 ms (0.00% GC)
  median time:      718.602 ms (0.00% GC)
  mean time:        719.084 ms (0.00% GC)
  maximum time:     722.105 ms (0.00% GC)
  --------------
  samples:          7
  evals/sample:     1

And of course, contiguous views too. None of these require allocations.

8 Likes

May I summarize what we learnt here?

  1. The original post had python doing a*b which is not a matrix multiplication. Thus the comparison was not correct. That was changed to a@b, and the difference in performance reported is much smaller (30%), not “twice”.
  2. The remaining difference in performance is somewhat system dependent. If I copy/paste now the original codes I get, in my machine, the same time for both of them (~950 ms). Yet the benchmarks oscilate a bit, because they are calling BLAS with multi-threading on the background, and probably other programs can compete for the processor usage. There may be an issue concerning the number of threads launched by the BLAS routine.
  3. That, considering the fact that in Python the line tmp2[...] = t@tmp1 is not allocating a new array, while the line tmp2[...] = t*tmp1 is allocating a new array in Julia.
  4. Solving that specific allocation in Julia requires a more verbose syntax (mul!(@view(tmp2[...]),t,tmp1). It might improve slightly the performance (10% maybe), but the timings vary because of the same reasons above.
  5. Avoding other allocations can readily make the Julia code run 2x faster than the original one. That can probably be done with Python as well.
  6. More advanced modifications and 32 bit representation of the matrices can make the code 50x faster than the original one (Elrod batch version), but that is advanced indeed.

Finally, I congratulate all for the very pleasant and civilized conversation! :slight_smile:

31 Likes

Hi!

I have an iMac (2 physical cores) and I did those tests (macOS Big Sur). However, my Python knowledge is very limited, hence I might did something wrong (test here is the very first implementation):

Julia

julia> BLAS.get_num_threads()
4

julia> @btime test($n, $k, $tmp2)
  1.330 s (3660 allocations: 1.23 GiB)

julia> BLAS.set_num_threads(2)

julia> @btime test($n, $k, $tmp2)
  1.397 s (3660 allocations: 1.23 GiB)

Python

python3 -m timeit "import teste; teste.main()"
1 loop, best of 5: 1.37 sec per loop

As you can see Julia is faster even using the first version without any optimization. My python installation is “vanilla”, just installed from homebrew.

1 Like

By the way, I think it is a classic example for something like Batch Mode (Something similar to what @Elrod did in his code).

1 Like

To add another datapoint, here are the timings I get with the original code on my dual-core i5-7200U @ 2.5 GHz.

Julia 1.6:

function test(n, k, tmp2)
    for i in 1:k
        t = rand(n, n) 
        for j in 1:k
            tmp1 = rand(n, n)
            tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n] = t*tmp1
        end
    end
end;

n = 300
k = 30
tmp2 = zeros(n*k, n*k)

@time test(n, k, tmp2)
@time test(n, k, tmp2)

  2.662469 seconds (2.55 M allocations: 1.362 GiB, 7.36% gc time, 31.38% compilation time)
  1.312242 seconds (3.66 k allocations: 1.227 GiB, 0.51% gc time)

Python 3.8.5:

import numpy
from numpy import zeros
from numpy.random import rand

import time

def test(n,k,tmp2):
    for i in range(k):
        t = rand(n, n) 
        for j in range(k):
            tmp1 = rand(n, n)
            tmp2[i*n:(i+1)*n,j*n:(j+1)*n] = t@tmp1

n = 300
k = 30
tmp2=zeros((n*k,n*k))

t0 = time.time()
test(n, k, tmp2)
print(time.time() - t0)

t0 = time.time()
test(n, k, tmp2)
print(time.time() - t0)

5.234311103820801
5.037538290023804

So on my system Python is 3.8x slower…

2 Likes

Without devaluing the insights from this discussion and well aware that this is just an MWE, it may also be worth noticing that constructions like tmp2[i*n:(i+1)*n,j*n:(j+1)*n] = t@tmp1 have a tendency to appear as the result of designing a solution around the constraints that only vectorized computations are fast. It’s possible that the original problem only has a natural solution of this style but it could also be that a solution designed for Julia would look very different.

15 Likes