Drastic performance hit matrix multiply different types. Internal cast julia vs numpy?


#1

Because of several performance problems I have seen in discourse I wanted to ask why this is happening. And how is it possible it is not happening in numpy?

I know we can avoid the problem I present here beeing careful when defining arrays but the performance hit seems still too big.

Here

NUMPY
N = 1000 time float64 and float64 0.05866699999999997 
N = 1000 time float64 and float32 0.06433399999999995 
N = 3000 time float64 and float64 1.2407199999999996 
N = 3000 time float64 and float32 1.3071000000000002 

JULIA 0.6
N = 1000 same type          0.077815 seconds 
N = 1000 different type      1.332816 seconds 
N = 3000 same type          0.431971 seconds
N = 3000 different type      36.492838 seconds

Julia 1.0
N = 1000 same type          0.020056 seconds 
N = 1000 different type      0.947275 seconds 
N = 3000 same type          0.545358 seconds
N = 3000 different type      24.154566 seconds 

Numpy code


import time
import numpy as np


N = 1000
X = np.random.rand(N,N)
Y = np.random.rand(N,N)

t = time.process_time()
aux = np.dot(X,Y)
elapsed_time = time.process_time() - t
print("N = 1000 time float64 and float64 {} ".format(elapsed_time))
X = np.random.rand(N,N)
Y = np.array(np.random.rand(N,N), dtype="Float32")
t = time.process_time()
aux2 = np.dot(X,Y)
elapsed_time = time.process_time() - t
print("N = 1000 time float64 and float32 {} ".format(elapsed_time))

N = 3000
X = np.random.rand(N,N)
Y = np.random.rand(N,N)

t = time.process_time()
aux = np.dot(X,Y)
elapsed_time = time.process_time() - t
print("N = 3000 time float64 and float64 {} ".format(elapsed_time))
X = np.random.rand(N,N)
Y = np.array(np.random.rand(N,N), dtype="Float32")
t = time.process_time()
aux2 = np.dot(X,Y)
elapsed_time = time.process_time() - t
print("N = 3000 time float64 and float32 {} ".format(elapsed_time))

Julia code


X = rand(10,10)
Y = rand(10,10)
X*Y

X = rand(10,10)
Y = rand(Float32, 10,10)
X*Y

N = 1000

X = rand(N,N)
Y = rand(N,N)
println("N = 1000 same type")
@time X*Y

X = rand(N,N)
Y = rand(Float32, N ,N)
println("N = 1000 different type")
@time X*Y

N = 3000

X = rand(N,N)
Y = rand(N,N)
println("N = 3000 same type")
@time X*Y

X = rand(N,N)
Y = rand(Float32, N ,N)
println("N = 3000 different type")
@time X*Y

Quite bad performance of Julia 0.6.4 vs Python+Numpy
#2

The relevant performance differences are with

julia> @time LinearAlgebra.generic_matmatmul!(copy(X),'N','N', X, X);
  1.228295 seconds (12 allocations: 7.630 MiB)

julia> @time LinearAlgebra.generic_matmatmul!(copy(X),'N','N', X, Y);
  1.519281 seconds (12 allocations: 7.630 MiB)

julia> @time LinearAlgebra.gemm_wrapper!(copy(X),'N','N', X, X);
  0.072593 seconds (6 allocations: 7.630 MiB, 8.70% gc time)

julia> @time LinearAlgebra.gemm_wrapper!(copy(X),'N','N', X, Y);
# MethodError

#3

I don’t follow you sorry. I guess you are inspecting what is internally called when I write X*Y,

In any case If I try to inspect what is going on I don’t see why there should be such a big penalty hit.

I would expect than when one array is Float32 and the other Float64 the Float32 gets copied as Float64 and the same call is done (or something like this). Whatever internally Julia is doing takes much longer than copying the array to a new one Array of the same type and calling a function.

julia> X = rand(1000,1000);

julia> Y = rand(1000,1000);

julia> Y32 = rand(Float32,1000,1000);

julia> @code_typed X*Y
CodeInfo(
141 1 ─ %1 = (Base.arraysize)(A, 1)::Int64                                                                                                                                                   │╻   size
    │   %2 = (Base.arraysize)(B, 2)::Int64                                                                                                                                                   ││  
    │   %3 = $(Expr(:foreigncall, :(:jl_alloc_array_2d), Array{Float64,2}, svec(Any, Int64, Int64), :(:ccall), 3, Array{Float64,2}, :(%1), :(%2), :(%2), :(%1)))::Array{Float64,2}           ││╻╷  Type
    │   %4 = invoke LinearAlgebra.gemm_wrapper!(%3::Array{Float64,2}, 'N'::Char, 'N'::Char, _2::Array{Float64,2}, _3::Array{Float64,2})::Array{Float64,2}                                    │╻   mul!
    └──      return %4                                                                                                                                                                       │   
) => Array{Float64,2}

julia> @code_typed X*Y32
CodeInfo(
141 1 ─ %1 = (Base.arraysize)(A, 1)::Int64                                                                                                                                                   │╻   size
    │   %2 = (Base.arraysize)(B, 2)::Int64                                                                                                                                                   ││  
    │   %3 = $(Expr(:foreigncall, :(:jl_alloc_array_2d), Array{Float64,2}, svec(Any, Int64, Int64), :(:ccall), 3, Array{Float64,2}, :(%1), :(%2), :(%2), :(%1)))::Array{Float64,2}           ││╻╷  Type
    │   %4 = invoke LinearAlgebra.generic_matmatmul!(%3::Array{Float64,2}, 'N'::Char, 'N'::Char, _2::Array{Float64,2}, _3::Array{Float32,2})::Array{Float64,2}                               │╻   mul!
    └──      return %4                                                                                                                                                                       │   
) => Array{Float64,2}

julia> @code_lowered X*Y32
CodeInfo(
140 1 ─ %1 = (LinearAlgebra.eltype)(A)                                                                                                                                                                │
    │   %2 = (LinearAlgebra.eltype)(B)                                                                                                                                                                │
    │        TS = (LinearAlgebra.promote_op)(LinearAlgebra.matprod, %1, %2)                                                                                                                           │
141 │   %4 = TS                                                                                                                                                                                       │
    │   %5 = (LinearAlgebra.size)(A, 1)                                                                                                                                                               │
    │   %6 = (LinearAlgebra.size)(B, 2)                                                                                                                                                               │
    │   %7 = (Core.tuple)(%5, %6)                                                                                                                                                                     │
    │   %8 = (LinearAlgebra.similar)(B, %4, %7)                                                                                                                                                       │
    │   %9 = (LinearAlgebra.mul!)(%8, A, B)                                                                                                                                                           │
    └──      return %9                                                                                                                                                                                │
)

julia> @code_lowered X*Y
CodeInfo(
140 1 ─ %1 = (LinearAlgebra.eltype)(A)                                                                                                                                                                │
    │   %2 = (LinearAlgebra.eltype)(B)                                                                                                                                                                │
    │        TS = (LinearAlgebra.promote_op)(LinearAlgebra.matprod, %1, %2)                                                                                                                           │
141 │   %4 = TS                                                                                                                                                                                       │
    │   %5 = (LinearAlgebra.size)(A, 1)                                                                                                                                                               │
    │   %6 = (LinearAlgebra.size)(B, 2)                                                                                                                                                               │
    │   %7 = (Core.tuple)(%5, %6)                                                                                                                                                                     │
    │   %8 = (LinearAlgebra.similar)(B, %4, %7)                                                                                                                                                       │
    │   %9 = (LinearAlgebra.mul!)(%8, A, B)                                                                                                                                                           │
    └──      return %9                                                                                                                                                                                │
)


#4

Yes, exactly, and gemm_wrapper!, the call to highly optimised BLAS code, is expected to be more efficient, e.g. https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0, the question is how much difference is expected.


#5

Could this be a bug?

I guess copying an array and calling LinearAlgebra.gemm_wrapper! should not make the code 64 or 58 times slower…

But I really don’t understand all the internals of what happens in this situation. The performance hit is surprising nevertheless.


#6

It is better to use @btime from BenchmarkTools when timing individual operations. Using @btime, however, I observed the same behavior as you (Julia 1.0, Windows 10). I suggest you open an issue about this.


julia> N = 1000
1000

julia> X = randn(N,N);

julia> Y2 = rand(Float32,N,N);

julia> @btime X*Y2;
  958.907 ms (8 allocations: 7.63 MiB)

julia> @btime X*Matrix{Float64}(Y2);
  18.241 ms (4 allocations: 15.26 MiB)

#7

The option is to promote the Float32 matrix, do the multiplication in double precision and then copy back. This sort of breaks what one typically expects of the ! suffix in that it shouldnt do extra copies though.

If you value speed of execution over memory you can always just promote it yourself.


#8

I understand how I can solve this, but was this the expected default behaviour? When I see the code_lowered I agree the ! suffix suggest no copies. Neverhteless from my “high level code” X*Y I already expect memory to be allocated. And I was expecting a Blas call !

N = 1000

X = rand(N,N)
Y = rand(N,N)
println("N = 1000 same type")
@btime X*Y

X = rand(N,N)
Y = rand(Float32, N ,N)
println("N = 1000 different type")
@btime X*Y

Gives me same allocations in MiB but a drastically worse performance

N = 1000 same type
  15.822 ms (2 allocations: 7.63 MiB)
N = 1000 different type
  1.326 s (8 allocations: 7.63 MiB)

I would rather have some extra copies and much better speed. I can see this as a personal preference but at the end of the day If I expected a call to a highly optimized Blas (and therefore speed).

This was the issue of the huge performance hit from here:

and I expect it to happen to a lot of people.

If that was the expected behaviour maybe there should be a line in performance tips stating “Matrix multiplication with different types will destroy your performance yet it won’t allocate memory. This was a design decision.”.

Arrays are a remarkable part of Julia, and I expect that for a lot of people LinearAlgebra will be also relevant. Since the language clearly shows speed as a big feature (the comparisson between languages here https://julialang.org/benchmarks/ does not show memory) I felt the fastest solution should be the default one.

https://docs.julialang.org/en/v1/manual/performance-tips/index.html

Certainly in other frameworks (pytorch/numpy/matlab) this is not the default behaviour.


#9

There is no BLAS call for mixed-precision operations.

The basic options here are:

  1. Require the caller to promote to BLAS-compatible types if they want BLAS performance.

  2. Automatically promote behind the scenes, allocating a (potentially) large temporary matrix that the user didn’t request.

  3. Try to write a native type-generic Julia matmul that is competitive with an optimized BLAS library. This is extremely hard (in any language — BLAS libraries are often 100k lines of C code that require continual retuning to new hardware as well as a lot of expertise).

There is a reasonable case to be made for option (2) for A*B, which allocates anyway (but not for an in-place function like mul!).


#10

I completly agree about the mul! behaviour seeing how all functions with ! work in julia.

My “surprise” was the default behaviour. Users that really want to optimize the code will end up using A_mul_B! or `mul!. Thus, they will not encounter this problem and will probably don’t care about having another default behavior. At the end of the day it’s a behaviour you don’t want to use if you are optimizing your code.

I guess that option (3) would be the best option but, since that is a huge amount of work… we are now stuck with (1) or (2).

(1) - I guess that this option should print “Error you are trying to multiply an Array of eltype X with and Array of eltype Y”. I would be OK with this default behaviour since it would force me to adress the problem. Besides I would know where my problem is.

(2) - This was the default behaviour I was expecting and Is the one I would like to have. Nevertheless I am used to program in Matlab and python so I can be a “non representative user”.

My issue is that I used @code_warntype in a function that had this issue and I didn’t see any warnings so I left the code multiplying matrices of different types. If there is a macro that tells me “hei! this could be much faster take care with the types” I would be also happy.


#11

Having an error or a warning by default is not an option. In Julia, we don’t refuse to give the correct answer just because the code is suboptimal. It’s perfectly reasonable to write slow code when you are in an exploration stage and only optimize it later.


#12

And that is why I prefer the second option!
But 1.30 sec in numpy vs 36 sec in Julia seems like too big of a tradeoff for some memory. I agree it is subjective.

You can always argue that there is the case where X*Y using (2) would exceed the available memory and freeze your machine whereas the current X*Y would take longer but return the correct result.
This case seems though a bit of a corner case.


#13

The generic matmul has nested loops, which are not yet properly optimized: by replacing the inner one with a call to a Julia dot function (which is compiled to good SIMD code, also with mixed precision), I get a speedup of 2x. For matrices of size 64x64, this is only about 5x slower than 8-fold multithreaded BLAS. (You are of course correct that matching the latter would take Herculean labor.)

For the mid-size (1000x1000) problem described above, the generic matmul is up to 100x times slower than BLAS on some workstations (40x with the dot-loop replacement). Doesn’t this suggest that the blocking algorithm in LinearAlgebra.generic_matmul! is flawed? But perhaps the better question is whether it is worth the time and effort of a competent developer to deal with the flaw, given that this is a niche case and native multithreading may prompt a different solution in several months.


#14

Couldn’t the code that copmues X*Y check at runtime the size of the inputs arrays and decide to copy/not to copy depending on the speedup? If the speedup is expected to be small don’t bother making a copy.

It seems like a hacky thing to do but all the “performant” code I have seen that deals with matrix multiplies ends up having decisions based on sizes of the input arrays…


#15

What about @code_warntype part of David’s message?


#16

What about it? Code warntype is not a performance linter. Type stable code can be slow and unstable code can be fast.