LU factorization performance issue

Abhilash · June 5, 2022, 7:36pm

I am comparing times taken for LU decomposition and my julia version seems to do badly compared to other language versions. The matrix is a 10000x10000 random dense matrix generated using DLATMR.

Calling dgetrf (openblas v0.3.20) directly from fortran takes about 3.7s. Python (scipy) takes about 5.4s. Julia for some reason takes about 15s. I include the python and julia codes. I was expecting a similar timing as the other two versions. BLAS.get_config() shows that Julia is also calling an OpenBLAS library. I am only measuring the factorization time.

I am trying to find the reason for the apparent slowness of julia in this case.

Python Version

import numpy as np
import struct
import scipy.linalg as la
import time
with open("A.dat", 'rb') as f:
    data=f.read()

N,= struct.unpack('i', data[:4])
A= np.frombuffer(data[4:4+8*N*N], dtype=np.float64)
A= A.reshape((N,N))
B= np.frombuffer(data[8*N*N+4:], dtype=np.float64)
print(f"{N=}")
# print(A)
# print(B)

t1= time.time()
lu,piv= la.lu_factor(A)
t2=time.time()-t1
print(f"{t2=}")
x= la.lu_solve((lu,piv), B)
print(x[:10])
print("B-AX==0 ? :",np.allclose(A@x,B))

Julia v1.7.3

using LinearAlgebra, Statistics;
print("BLAS thread count: ",BLAS.get_num_threads(),'\n') # is 8
io=open("A.dat","r")
N=read(io,Int32)
A=Array{Float64}(undef,N,N)
B=Array{Float64}(undef,N,1)
read!(io,A)
read!(io,B)
close(io)
print(N,"\n")

t0=time()
L,U,P=lu(A)
t1=time()

X=U\(L\B[P])

print(t1-t0, '\n')
print(X[1:10], '\n')
print(sqrt(mean((B-A*X).^2)))

The matrix itself is nothing special but its a bit large to share. The config used to generate it is below.

dlatmr config

  M=10000
  N=10000
  SYM='S'
  DIST='U'
  ISEED = [0,47,2000,101]
  MODE = 5
  COND = 1000.0
  DMAX = 100D0
  RSIGN = 'T'
  GRADE = 'N'
  MODEL = 1
  MODER = 1
  PIVTNG = 'N'
  SPARSE = 0.0
  KL = M
  KU = M
  ANORM = 1.0d0
  PACK = 'N'
  LDA = M

jling · June 5, 2022, 7:53pm

can’t reproduce

In [1]: import numpy as np

In [2]: A = np.random.rand(10000, 10000);

In [3]: import scipy.linalg as la

In [4]: %time la.lu_factor(A)
CPU times: user 39.7 s, sys: 3.03 s, total: 42.8 s
Wall time: 3.15 s

In [5]: %time la.lu_factor(A);
CPU times: user 39.1 s, sys: 3.26 s, total: 42.4 s
Wall time: 3.04 s

Julia

julia> using LinearAlgebra

julia> const B = rand(10000, 10000);

julia> @time lu(B);
  1.990439 seconds (4 allocations: 763.016 MiB)

julia> @time lu(B);
  1.860999 seconds (4 allocations: 763.016 MiB, 0.38% gc time)

jling · June 5, 2022, 7:54pm

maybe give these two const and also don’t include Julia JIT compile time when benchmarking

Abhilash · June 5, 2022, 8:04pm

Interesting. Something is wrong indeed with my installation of Julia then.

julia> using LinearAlgebra

julia> const B = rand(10000,10000);

julia> @time lu(B);
 14.362888 seconds (4 allocations: 763.016 MiB, 0.03% gc time)

julia> @time lu(B);
 14.545017 seconds (4 allocations: 763.016 MiB, 0.14% gc time)

Oscar_Smith · June 5, 2022, 8:04pm

What cpu do you have? Also, how did you install Julia?

Abhilash · June 5, 2022, 8:06pm

CPU is i7-1165G7. I installed the 64bit v1.7.3 from Download Julia
OS is Windows 10

ederag · June 5, 2022, 8:18pm

Maybe you are in a single thread session ?

Single threaded in Pluto:

11.503429 seconds (4 allocations: 763.016 MiB)

Multi-threaded in REPL:

1.879096 seconds (4 allocations: 763.016 MiB)

Abhilash · June 5, 2022, 8:22pm

I dont think that is the issue. The previous ones were single threaded. This one with 8 threads still shows same result

julia> Threads.nthreads()
8

julia> const B = rand(10000,10000);

julia> using LinearAlgebra

julia> @time lu(B);
 14.207715 seconds (4 allocations: 763.016 MiB, 0.02% gc time)

julia>

jling · June 5, 2022, 8:24pm

how much RAM do you have?

Abhilash · June 5, 2022, 8:25pm

32GB RAM

also thanks for extracting the core problem from my mess of a post haha.

jling · June 5, 2022, 8:34pm

what if you first

BLAS.set_num_threads(4)

?

Abhilash · June 5, 2022, 8:45pm

Doesnt help unfortunately. I tried setting it to 4 and 8. BLAS.get_num_threads was 8 by default actually.

Abhilash · June 5, 2022, 8:48pm

I tried to replace Julia’s libopenblas with another one. The replacement was not ilp64. Im going to build an ilp64 version and see if that clarifies things.

Abhilash · June 5, 2022, 9:03pm

The problem is coming from the openblas bundled with the installer (v1.7.3 in this case). Replacing the libopenblas64_.dll with a built from source version fixed this problem.

julia> using LinearAlgebra;

julia> const B = rand(10000,10000);

julia> @time lu(B);
  3.506253 seconds (4 allocations: 763.016 MiB, 0.09% gc time)

julia> @time lu(B);
  3.607796 seconds (4 allocations: 763.016 MiB, 0.07% gc time)

julia>

Thanks all for your help!

jling · June 5, 2022, 9:20pm

I thought you downloaded from Julia website? which link did you click on

Abhilash · June 5, 2022, 9:24pm

I did install it from the website.
In trying to find the issue, inside Julia’s installation folder I replaced the openblas library with one that i built from source.

Also I went to file a bug report but this was already there
https://github.com/JuliaLang/julia/issues/45090

jling · June 5, 2022, 9:26pm

yeah you should have said that as the first thing sir

Abhilash · June 5, 2022, 9:28pm

I think you misunderstood. The one shipped with the installer is borked. The replacement works as expected.

jling · June 5, 2022, 9:28pm

I see, hopefully this is fixed in the upcoming 1.8 as discussed in the github issue.

Abhilash · June 5, 2022, 11:02pm

I just tested with 1.8 rc1 and it works properly out of the box.

Topic		Replies	Views
Performance gotcha in linear algebra lu() General Usage performance , linearalgebra	33	3618	February 11, 2020
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36479	June 19, 2020
Linear solver \(A, B) performance vs Matlab A\b General Usage	32	7673	May 21, 2017
Pseudo-inverse of large matrix very slow New to Julia question , performance , linearalgebra , python	20	4239	January 12, 2022
OpenBLAS: Julia slower than R Performance linearalgebra	41	7831	March 26, 2019

LU factorization performance issue

Julia

Related topics