Why is this simple function twice as slow as its Python version

This is because array slices make copies in Julia.

The main reason for this is that some years ago array views were expensive and slow, so copies were preferred. As far as I understand, most people expected this to be changed at some point, but when views finally became efficient, the consequence of changing slices over to views was breaking virtually all Julia code in existence.

If slices were views in the first place, this would not have been an issue, but that train has sailed.

5 Likes

If you are hoping to run precisely your python code, and have it faster, then I think you will be disappointed. The guys who wrote numpy are not idiots. There is a reason this is a whole different language.

It’s young and it’s far from perfect, but it does do very well on linear algebra type problems. One reason is precisely that you can better control how multi-threading gets used, and memory access.

Lots of people are happy to try to show you how. But I think the problem needs to be a bit better specified. People have posted fast solutions. And I have posted a very short, fast, solution. But, of course, these change the code.

5 Likes

That’s a problem that Julia is working on in the GitHub issue linked above https://github.com/JuliaLang/julia/issues/33409

This explains a lot. This looks to me like a huge problem if one wants to use Julia for NLA.
I tried with
@views tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n] = t*tmp1
which did not made a difference so one really has to be careful with such operations.

From what I understand up to now:

  1. You are measuring essentially the time of matrix multiplication. There is effectively one detail which is that Julia starts BLAS with a number of threads equal to number of threads and Python starts BLAS with the number of threads equal to number of physical cores. The last option seems to be better, although that is machine-dependent and also matrix-size dependent. You may want to try to run the Julia example after setting the number of threads equal to the number of physical cores with BLAS.set_num_threads(4) (where 4 here is the number of physical cores).

  2. The above may or or may not elimiate the difference between the two. In my machine it does. But even in my machine Python is using MKL and Julia is using OpenBLAS, thus were are not comparing exactly the same thing. That being said, MKL compilation halted for me as well here (Julia 1.6). If it fails again we should probably report an issue.

  3. Matrix multiplication may occur in place or not. Using c = a*b does not do, in Julia (it allocates a new c). Does it in Python? For multiplying in place you need to use mul!. The code is a little more “complicated” because you are assigning an output to a slice of the preallocated output, thus the need of the view. Again, I don’t know how “easy” multiplying matrices and saving the result in a preallocated slice of an array is in Python. It may be simpler, maybe. But in Julia it is:

julia> c = a*b # allocates the result in a new c

julia> mul!(c,a,b) # saves the result to preallocated c

julia> mul!(@view(c[1:10]), a,b) # save the result in a preallocated slice of c

There is also Octavian, to do matrix multiplications, with matmul!. It is pure Julia and as fast as MKL, at least tested here.

Anyway, if the code is essentially dependent on matrix multiplication performance, you will not get a very huge speedup with Julia relative to numpy. Both use the same routines in the background and except from some tweaking in one or the other, they should behave the same.

5 Likes

It’s np.matmul(A,B, out=C).

1 Like

I don’t need that Julia is faster than python, just in the same ballpark. And there is no real python code here, just two loops and a simple matrix operation, it should be language agnostic.
Here the issue is the speed of implementation. If I need much more time to do it in Julia than in python/numpy to just have the same performance as in python (or even twice as fast as in python), that is a dealbreaker.

Well, if you know python very well already, and are starting with Julia, certainly there is a learning curve there. But in general the Julia syntax is quite easy, and optimizing the code for performance is much easier than in python (not saying possible vs. impossible… except if you rely on cython, etc, which is, that I am sure, much more cumbersome than Julia).

And a whole lot of allocation & garbage collection, in ways which might be avoidable in real usage. And the generation of lots of random numbers inside the loop, which again is unlikely to be what you are actually interested in – unless, as I said, you are trying to prove the central limit theorem.

I’m sure you know this, but view(c ,1:1) = a * b would still allocate for the RHS. It’s d = similar(a); mul!(d, a, b). Making that auto-fuse would… probably be too magical, but there was a macro at some point which did such things, I think?

And of course rand(n,n) allocates a lot of new matrices.

Edit: I found the macro, but it doesn’t handle this:

julia> using InplaceOps

julia> @macroexpand @! C = A * B
:(C = InplaceOps.mul!(C, A, B))

julia> @macroexpand @! C[1:2, 1:2] = A * B  # sadly not!
ERROR: LoadError: Invalid use of @! macro
2 Likes

Thanks for this, now I’m getting a good picture on what is happening.
I tried BLAS.set_num_threads(4), this actually slowed down the computation (I do have 4 cores).

2 Likes

As another data point, I get the same performance between Julia and Python on a i5 macbook using MacOS.

julia> @btime test($n, $k, $tmp2);
  2.060 s (3660 allocations: 1.23 GiB)

and

In [9]: %timeit test(n, k, tmp2)
2.09 s ± 475 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

OP, I think ultimately it’s likely that somehow python is hitting a slightly more optimized matrix multiplication method and is by default using a slightly better number of threads. That’s okay. Numpy is a very well developed framework and is doing a good job dispatching in this case.

The fact that this difference appears or disappears depending on the platform is interesting. Maybe Julia using a different method is actually better for performance in some different, but still similar problem that makes this a worthwhile tradeoff.

It’s obviously kind of hard to pin down. A full MWE that is able to isolate the difference in multiplication method that is causing this would make for a good issue at the Julialang repo.

3 Likes

After all I have learned here, my opinion on numpy has increased considerably.
It seems that numpy interprets tmp2[i*n:(i+1)*n,j*n:(j+1)*n] = t@tmp1 as
np.matmul(t,tmp1, out=tmp2[i*n:(i+1)*n,j*n:(j+1)*n]) without me having to think about it. That’s very nice.

1 Like

Anyway, a faster version of the complete code would be as the one below, with some comments. Being used to Julia one would write that more or less directly:

julia> using LinearAlgebra
       function test_fast(n::Int64,k::Int64,tmp2::Array{Float64,2})
           t = similar(tmp2,n,n) # preallocates t without initializing it
           tmp1 = similar(t) # preallocates tmp1
           for i in 1:k
               t .= rand.() # assigns a rand value to each element of t
               for j in 1:k
                   tmp1 .= rand.() # assigns a rand value for each element tmp1
                   buf = @view tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n] # set the output slice
                   mul!(buf,t,tmp1) # multiply in place and save in buf
               end
           end
       end
       n = 300; k = 30; tmp2=zeros((n*k,n*k));
       @btime test_fast($n,$k,$tmp2)
  494.257 ms (4 allocations: 1.37 MiB)

4 Likes

Yes, that is very interesting.
Is then test2 function by @dmolina 4 times as fast as python’s?

I can’t pinpoint which function that is. But the one just above is definitely faster than Python’s

julia> @btime test_fast($n,$k,$tmp2)
  1.181 s (4 allocations: 1.37 MiB)

julia> @btime test_fast($n,$k,$tmp2)
  1.403 s (4 allocations: 1.37 MiB)

That is interesting as test_fast on my system is 3 times faster than test, and your gain is much smaller.

I guess these differences are related to the fact that the blas call in the background is using multi-threading with all cpus available, and what one gets depends on what is running in the computer besides the code.

Also, the Python version can be improved in most of the same points that the Julia version above, by preallocating the arrays. The difference in performance will get smaller, since the only computation there is the matrix multiplication. But, then, I don’t know how ugly or pretty would be the corresponding python code.

One note: Apparently numpy defaults to 32bit MKL. I am not sure what are the implications of this to result of a matrix multiplication. Yet, if one uses Octavian, which is pure Julia, changing from one to the other number representation depends on the element types of the input matrices. With that, the Julia code becomes much faster:

julia> using Octavian, BenchmarkTools
       function test_fast(n,k,tmp2)
           t = Matrix{eltype(tmp2)}(undef,n,n)
           tmp1 = similar(t)
           for i in 1:k
               t .= rand.() 
               for j in 1:k
                   tmp1 .= rand.()
                   buf = @view(tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n])
                   matmul!(buf,t,tmp1)
               end
           end
       end
       n = 300; k = 30; tmp2=zeros(Float32,n*k,n*k);
       @btime test_fast($n,$k,$tmp2)
  170.436 ms (4 allocations: 703.28 KiB)

Note that the function itself is agnostic relative to the number type, which is defined only by the way one initializes tmp2 outside the function. This is a beauty of Julia, which in this case is propagated all the way down to the matrix multiplication.

8 Likes

That’s not completely accurate; when we measured it at the decision-point (pre-1.0), the performance wasn’t as straight-forwardly a win as we had expected and the change in semantics wasn’t obviously better. Views are even faster yet since 1.5, but I still don’t think it’s a clear-cut performance win across the board (and now we’re actually constrained by 1.x stability).

6 Likes

Regarding the RNG point, VectorizedRNG should be faster.

julia> x = rand(Float32, n, n);

julia> @btime rand!(local_rng(), $x);
  7.223 μs (0 allocations: 0 bytes)

julia> @btime $x .= rand.();
  323.120 μs (0 allocations: 0 bytes)

But it makes little difference compared to matrix multiplication:

julia> n = 300; k = 30; tmp2=zeros(Float32,n*k,n*k);

julia> function test_fast_rng(n,k,tmp2)
                  t = Matrix{eltype(tmp2)}(undef,n,n)
                  tmp1 = similar(t); rng = local_rng();
                  for i in 1:k
                      rand!(rng, t)
                      for j in 1:k
                          rand!(rng, tmp1)
                          buf = @view(tmp2[(i-1)*n+1:i*n,(j-1)*n+1:j*n])
                          matmul!(buf,t,tmp1)
                      end
                  end
              end
test_fast_rng (generic function with 1 method)

julia> @btime test_fast($n,$k,$tmp2)
  33.926 ms (4 allocations: 703.28 KiB)

julia> @btime test_fast_rng($n,$k,$tmp2)
  33.556 ms (4 allocations: 703.28 KiB)

Also, because this is Julia, we can thread our loops, and not just BLAS:

julia> using CheapThreads, Octavian, BenchmarkTools

julia> using Random, VectorizedRNG

julia> function test_batch(n,k,tmp2)
           @batch for i in 1:k
               t = Matrix{eltype(tmp2)}(undef, n, n);
               tmp1 = similar(t);
               rng = local_rng()
               rand!(rng, t)
               for j in 1:k
                   rand!(rng, tmp1)
                   buf = @view(tmp2[(j-1)*n+1:j*n,(i-1)*n+1:i*n])
                   matmul_serial!(buf,t,tmp1)
               end
           end
       end
test_batch (generic function with 1 method)

julia> n = 300; k = 30; tmp2=zeros(Float32,n*k,n*k);

julia> @btime test_batch($n,$k,$tmp2)
  20.605 ms (125 allocations: 20.60 MiB)

Edit:
Also, 32 vs 64 bit BLAS is about the size of the integers. MKL and OpenBLAS both handle Float32 and Float64 whether they’re 32 bit or 64 bit builds.
But Octavian handles multiplying Int32 and Int64, while those BLAS libraries don’t.
For singing more exotic, it takes just a few extensions to get a TropicalGEMM, where * is + and + is max.

10 Likes

The ILP32 and ILP64 in MKL is about the size of the pointers to the data.
While ILP64 allows handling larger matrices it means each operation of pointers is done with using arithmetic of 64 bit which is slower. It shouldn’t be ground breaking but it is there.

What I don’t get about all the games with views, does MKL / OpenBLAS (Basically any BLAS library) have support for writing the output of GEMM into non contiguous memory?
In the code above, the memory is a square within a larger allocated memory which means each column is contiguous. If I remember correctly the GEMM allows defining the jump between columns / rows hence it can be used. But if the indices were completely arbitrary I think all the views tricks wouldn’t work, right? As the call to GEMM would require at least the columns to be contiguous.

One simple thing to analyze in the code above is the number of allocations in Python.
If Python can generate a call to GEMM with no allocations by default, it might be a feature which should be borrowed.

3 Likes