LU factorization performance issue

I am comparing times taken for LU decomposition and my julia version seems to do badly compared to other language versions. The matrix is a 10000x10000 random dense matrix generated using DLATMR.

Calling dgetrf (openblas v0.3.20) directly from fortran takes about 3.7s. Python (scipy) takes about 5.4s. Julia for some reason takes about 15s. I include the python and julia codes. I was expecting a similar timing as the other two versions. BLAS.get_config() shows that Julia is also calling an OpenBLAS library. I am only measuring the factorization time.

I am trying to find the reason for the apparent slowness of julia in this case.

Python Version
import numpy as np
import struct
import scipy.linalg as la
import time
with open("A.dat", 'rb') as f:
    data=f.read()

N,= struct.unpack('i', data[:4])
A= np.frombuffer(data[4:4+8*N*N], dtype=np.float64)
A= A.reshape((N,N))
B= np.frombuffer(data[8*N*N+4:], dtype=np.float64)
print(f"{N=}")
# print(A)
# print(B)

t1= time.time()
lu,piv= la.lu_factor(A)
t2=time.time()-t1
print(f"{t2=}")
x= la.lu_solve((lu,piv), B)
print(x[:10])
print("B-AX==0 ? :",np.allclose(A@x,B))
Julia v1.7.3
using LinearAlgebra, Statistics;
print("BLAS thread count: ",BLAS.get_num_threads(),'\n') # is 8
io=open("A.dat","r")
N=read(io,Int32)
A=Array{Float64}(undef,N,N)
B=Array{Float64}(undef,N,1)
read!(io,A)
read!(io,B)
close(io)
print(N,"\n")

t0=time()
L,U,P=lu(A)
t1=time()

X=U\(L\B[P])

print(t1-t0, '\n')
print(X[1:10], '\n')
print(sqrt(mean((B-A*X).^2)))

The matrix itself is nothing special but its a bit large to share. The config used to generate it is below.

dlatmr config
  M=10000
  N=10000
  SYM='S'
  DIST='U'
  ISEED = [0,47,2000,101]
  MODE = 5
  COND = 1000.0
  DMAX = 100D0
  RSIGN = 'T'
  GRADE = 'N'
  MODEL = 1
  MODER = 1
  PIVTNG = 'N'
  SPARSE = 0.0
  KL = M
  KU = M
  ANORM = 1.0d0
  PACK = 'N'
  LDA = M

can’t reproduce

In [1]: import numpy as np

In [2]: A = np.random.rand(10000, 10000);

In [3]: import scipy.linalg as la

In [4]: %time la.lu_factor(A)
CPU times: user 39.7 s, sys: 3.03 s, total: 42.8 s
Wall time: 3.15 s

In [5]: %time la.lu_factor(A);
CPU times: user 39.1 s, sys: 3.26 s, total: 42.4 s
Wall time: 3.04 s

Julia

julia> using LinearAlgebra

julia> const B = rand(10000, 10000);

julia> @time lu(B);
  1.990439 seconds (4 allocations: 763.016 MiB)

julia> @time lu(B);
  1.860999 seconds (4 allocations: 763.016 MiB, 0.38% gc time)
2 Likes

maybe give these two const and also don’t include Julia JIT compile time when benchmarking

Interesting. Something is wrong indeed with my installation of Julia then.

julia> using LinearAlgebra

julia> const B = rand(10000,10000);

julia> @time lu(B);
 14.362888 seconds (4 allocations: 763.016 MiB, 0.03% gc time)

julia> @time lu(B);
 14.545017 seconds (4 allocations: 763.016 MiB, 0.14% gc time)

What cpu do you have? Also, how did you install Julia?

CPU is i7-1165G7. I installed the 64bit v1.7.3 from Download Julia
OS is Windows 10

Maybe you are in a single thread session ?

Single threaded in Pluto:

11.503429 seconds (4 allocations: 763.016 MiB)

Multi-threaded in REPL:

1.879096 seconds (4 allocations: 763.016 MiB)

1 Like

I dont think that is the issue. The previous ones were single threaded. This one with 8 threads still shows same result

julia> Threads.nthreads()
8

julia> const B = rand(10000,10000);

julia> using LinearAlgebra

julia> @time lu(B);
 14.207715 seconds (4 allocations: 763.016 MiB, 0.02% gc time)

julia>

how much RAM do you have?

32GB RAM

also thanks for extracting the core problem from my mess of a post haha.

what if you first

BLAS.set_num_threads(4)

?

Doesnt help unfortunately. I tried setting it to 4 and 8. BLAS.get_num_threads was 8 by default actually.

I tried to replace Julia’s libopenblas with another one. The replacement was not ilp64. Im going to build an ilp64 version and see if that clarifies things.

The problem is coming from the openblas bundled with the installer (v1.7.3 in this case). Replacing the libopenblas64_.dll with a built from source version fixed this problem.

julia> using LinearAlgebra;

julia> const B = rand(10000,10000);

julia> @time lu(B);
  3.506253 seconds (4 allocations: 763.016 MiB, 0.09% gc time)

julia> @time lu(B);
  3.607796 seconds (4 allocations: 763.016 MiB, 0.07% gc time)

julia>

Thanks all for your help!

1 Like

I thought you downloaded from Julia website? which link did you click on

I did install it from the website.
In trying to find the issue, inside Julia’s installation folder I replaced the openblas library with one that i built from source.

Also I went to file a bug report but this was already there
https://github.com/JuliaLang/julia/issues/45090

2 Likes

yeah you should have said that as the first thing sir

I think you misunderstood. The one shipped with the installer is borked. The replacement works as expected.

I see, hopefully this is fixed in the upcoming 1.8 as discussed in the github issue.

I just tested with 1.8 rc1 and it works properly out of the box.

2 Likes