I am comparing times taken for LU decomposition and my julia version seems to do badly compared to other language versions. The matrix is a 10000x10000 random dense matrix generated using DLATMR
.
Calling dgetrf
(openblas v0.3.20) directly from fortran takes about 3.7s. Python (scipy) takes about 5.4s. Julia for some reason takes about 15s. I include the python and julia codes. I was expecting a similar timing as the other two versions. BLAS.get_config()
shows that Julia is also calling an OpenBLAS library. I am only measuring the factorization time.
I am trying to find the reason for the apparent slowness of julia in this case.
Python Version
import numpy as np
import struct
import scipy.linalg as la
import time
with open("A.dat", 'rb') as f:
data=f.read()
N,= struct.unpack('i', data[:4])
A= np.frombuffer(data[4:4+8*N*N], dtype=np.float64)
A= A.reshape((N,N))
B= np.frombuffer(data[8*N*N+4:], dtype=np.float64)
print(f"{N=}")
# print(A)
# print(B)
t1= time.time()
lu,piv= la.lu_factor(A)
t2=time.time()-t1
print(f"{t2=}")
x= la.lu_solve((lu,piv), B)
print(x[:10])
print("B-AX==0 ? :",np.allclose(A@x,B))
Julia v1.7.3
using LinearAlgebra, Statistics;
print("BLAS thread count: ",BLAS.get_num_threads(),'\n') # is 8
io=open("A.dat","r")
N=read(io,Int32)
A=Array{Float64}(undef,N,N)
B=Array{Float64}(undef,N,1)
read!(io,A)
read!(io,B)
close(io)
print(N,"\n")
t0=time()
L,U,P=lu(A)
t1=time()
X=U\(L\B[P])
print(t1-t0, '\n')
print(X[1:10], '\n')
print(sqrt(mean((B-A*X).^2)))
The matrix itself is nothing special but its a bit large to share. The config used to generate it is below.
dlatmr config
M=10000
N=10000
SYM='S'
DIST='U'
ISEED = [0,47,2000,101]
MODE = 5
COND = 1000.0
DMAX = 100D0
RSIGN = 'T'
GRADE = 'N'
MODEL = 1
MODER = 1
PIVTNG = 'N'
SPARSE = 0.0
KL = M
KU = M
ANORM = 1.0d0
PACK = 'N'
LDA = M