new here in julia. i have tested AOCL, Openblas, intelMKL in my ryzen 3900x home computer. this the benchmark result by fortran and juliat version. as you can see AOCL actualy quite good. but still mkl wins eventhough in AMD. julia blas mkl is faster in dgetrf/dgetri than fortran but same performance in dgesv. hope there is chance AOCL lib getting involved in julia.
C:\Temp>aocl
Inside arraySub() allocating memory ...
Memory allocation sufficient
Benchmark running, hopefully as only ACTIVE task
Number of OpenMP threads: 24
Using system-configured threads: 24
Final BLAS thread count: 24
Test1: Gauss1OMP- 1000 ( 250x 250) inverts in 0.675 seconds Err= 0.6357568671408161E-14
Test2: CROUT1OMP- 1000 ( 250x 250) inverts in 0.967 seconds Err= 0.7115810242203004E-14
Calling CroutMT- Bigsize - Inverse1
Calling CroutMT- Bigsize - Inverse2
Test3: CROUTMT - 2 ( 2000x 2000) inverts in 2.811 seconds Err= 0.4444341847960061E-09
Test4: DGETRF - 2 ( 2000x 2000) inverts in 0.238 seconds Err= 0.1480195511849177E-11
Test5: DGETRF2 - 2 ( 2000x 2000) inverts in 0.232 seconds Err= 0.1480195511849177E-11
Test6: DGESV - 2 ( 2000x 2000) inverts in 0.245 seconds Err= 0.2163458817503340E-11
Total = 5.2 sec
C:\Temp>openblas
Inside arraySub() allocating memory ...
Memory allocation sufficient
Benchmark running, hopefully as only ACTIVE task
Number of OpenMP threads: 24
Using system-configured threads: 24
Final BLAS thread count: 24
Test1: Gauss1OMP- 1000 ( 250x 250) inverts in 0.559 seconds Err= 0.4285718182520538E-14
Test2: CROUT1OMP- 1000 ( 250x 250) inverts in 0.665 seconds Err= 0.1317388121697570E-13
Calling CroutMT- Bigsize - Inverse1
Calling CroutMT- Bigsize - Inverse2
Test3: CROUTMT - 2 ( 2000x 2000) inverts in 2.823 seconds Err= 0.1646278156525469E-09
Test4: DGETRF - 2 ( 2000x 2000) inverts in 0.520 seconds Err= 0.3912745006479850E-12
Test5: DGETRF2 - 2 ( 2000x 2000) inverts in 0.357 seconds Err= 0.3746751382358156E-12
Test6: DGESV - 2 ( 2000x 2000) inverts in 0.282 seconds Err= 0.4674058187435756E-12
Total = 5.2 sec
C:\Temp>mkl
Inside arraySub() allocating memory ...
Memory allocation sufficient
Benchmark running, hopefully as only ACTIVE task
Number of OpenMP threads: 24
Using system-configured threads: 24
Final BLAS thread count: 24
Test1: Gauss1OMP- 1000 ( 250x 250) inverts in 0.668 seconds Err= 0.5603592445228368E-14
Test2: CROUT1OMP- 1000 ( 250x 250) inverts in 0.821 seconds Err= 0.7494986656780616E-14
Calling CroutMT- Bigsize - Inverse1
Calling CroutMT- Bigsize - Inverse2
Test3: CROUTMT - 2 ( 2000x 2000) inverts in 2.818 seconds Err= 0.7692472547307751E-09
Test4: DGETRF - 2 ( 2000x 2000) inverts in 0.282 seconds Err= 0.1694665710140913E-11
Test5: DGETRF2 - 2 ( 2000x 2000) inverts in 0.273 seconds Err= 0.1791571555309279E-11
Test6: DGESV - 2 ( 2000x 2000) inverts in 0.161 seconds Err= 0.1742011662433139E-11
Total = 5.0 sec
C:\Temp>julia -t 24 -O2 test_fpu7.jl.txt
Threading Summary:
Base.Threads Configuration:
Threads.nthreads(): 24
BLAS Configuration:
BLAS vendor: lbt
BLAS.get_num_threads(): 12
BLAS backend libraries:
LBTConfig([ILP64] mkl_rt.2.dll, [LP64] mkl_rt.2.dll)
Benchmark running, hopefully as only ACTIVE task
Current time: 2025-05-14T02:26:42.205
Test1: GaussST - 1000 (250x250) inverts in 6.253 seconds Err= 4.2673054395647724e-15
Test2: CroutST - 1000 (250x250) inverts in 2.243 seconds Err= 1.5946885695561353e-12
Test3: RecursiST- 1000 (250x250) inverts in 1.293 seconds Err= 2.4250148167423610e-14
Test4: CroutMT - 1000 (250x250) inverts in 4.399 seconds Err= 1.5946885695561353e-12
Test5: CroutMT - 2 (2000x2000) inverts in 3.565 seconds Err= 3.4721243365073723e-10
Test6: DGETRF/I - 2 (2000x2000) inverts in 0.196 seconds Err= 5.2270673897888225e-13
Test7: DGESV - 2 (2000x2000) inverts in 0.166 seconds Err= 5.2866519987893978e-13
Test8: inv(A) - 2 (2000x2000) inverts in 0.196 seconds Err= 5.2270673897888225e-13
Test9: RecursiMT- 2 (2000x2000) inverts in 1.008 seconds Err= 5.7157070595421282e-13
Total = 19.3 sec
Fortran version is TEST_CPU Multithread Version of Polyhedron Benchmark - Pastebin.com
And julia version
using Random, LinearAlgebra, Dates, Printf
using LoopVectorization
using .Threads
using LinearAlgebra.BLAS
BLAS.set_num_threads(Sys.CPU_THREADS)
using RecursiveFactorization
using MKL
println("\nThreading Summary:")
println("\nBase.Threads Configuration:")
println("Threads.nthreads(): ", Threads.nthreads()) # Julia's multithreading
# Check which BLAS library is being used
println("\nBLAS Configuration:")
println("BLAS vendor: ", BLAS.vendor()) # MKL, OpenBLAS, etc.
# Get BLAS thread count (works for MKL/OpenBLAS)
println("BLAS.get_num_threads(): ", BLAS.get_num_threads())
println("BLAS backend libraries:")
println(BLAS.get_config())
using LinearAlgebra, LoopVectorization
function croutmt_inverse_optimized(A::Matrix{Float64})
n = size(A, 1)
L = zeros(n, n)
U = Matrix{Float64}(I, n, n) # U starts as identity
# Crout factorization (partially parallelized where possible)
@inbounds for j in 1:n
# Compute column j of L (can be partially vectorized)
@turbo for i in j:n
sum_val = zero(Float64)
for k in 1:j-1
sum_val += L[i,k] * U[k,j]
end
L[i,j] = A[i,j] - sum_val
end
# Check for singularity
if abs(L[j,j]) < eps(Float64) * n * sqrt(n)
error("Matrix is numerically singular during Crout factorization at step $j")
end
# Compute row j of U (right of diagonal)
inv_Ljj = 1.0 / L[j,j]
@turbo for i in j+1:n
sum_val = zero(Float64)
for k in 1:j-1
sum_val += L[j,k] * U[k,i]
end
U[j,i] = (A[j,i] - sum_val) * inv_Ljj
end
end
# Parallel inverse computation with better memory patterns
Ainv = zeros(n, n)
Threads.@threads for i in 1:n # Column-wise parallelization
e = zeros(n)
e[i] = 1.0
# Forward substitution (L * y = e)
y = zeros(n)
@inbounds @simd for row in 1:n
sum_val = zero(Float64)
@turbo for col in 1:row-1
sum_val += L[row, col] * y[col]
end
y[row] = (e[row] - sum_val) / L[row, row]
end
# Backward substitution (U * x = y)
x = zeros(n)
@inbounds for row in n:-1:1
sum_val = zero(Float64)
@turbo for col in row+1:n
sum_val += U[row, col] * x[col]
end
x[row] = (y[row] - sum_val) / U[row, row]
end
@turbo for k in 1:n # Faster column assignment
Ainv[k,i] = x[k]
end
end
return Ainv
end
function check_threads()
println("Number of threads being used: ", nthreads())
println("(Configure with Threads.nthreads() = N or JULIA_NUM_THREADS env var)")
end
function crout_inverse_single_threaded(A::Matrix{Float64})
n = size(A, 1)
L = zeros(n, n)
U = Matrix{Float64}(I, n, n)
block_size = 64 # Cache-friendly block size
# Optimized Crout factorization
@inbounds for j in 1:n
# Compute column j of L (cache-blocked)
for i_block in j:block_size:n
i_end = min(i_block + block_size - 1, n)
@turbo for i in i_block:i_end
sum_val = 0.0
for k in 1:j-1
sum_val += L[i,k] * U[k,j]
end
L[i,j] = A[i,j] - sum_val
end
end
# Check singularity
if abs(L[j,j]) < eps(Float64) * n * sqrt(n)
error("Matrix is numerically singular")
end
# Compute row j of U (cache-blocked)
inv_Ljj = 1.0 / L[j,j]
for i_block in j+1:block_size:n
i_end = min(i_block + block_size - 1, n)
@turbo for i in i_block:i_end
sum_val = 0.0
for k in 1:j-1
sum_val += L[j,k] * U[k,i]
end
U[j,i] = (A[j,i] - sum_val) * inv_Ljj
end
end
end
# Single-threaded solve for inverse columns
Ainv = zeros(n, n)
for i in 1:n
e = zeros(n)
e[i] = 1.0
y = zeros(n)
x = zeros(n)
# Forward substitution (optimized)
@inbounds for row in 1:n
sum_val = 0.0
@turbo for col in 1:row-1
sum_val += L[row,col] * y[col]
end
y[row] = (e[row] - sum_val) / L[row,row]
end
# Backward substitution (optimized)
@inbounds for row in n:-1:1
sum_val = 0.0
@turbo for col in row+1:n
sum_val += U[row,col] * x[col]
end
x[row] = (y[row] - sum_val) / U[row,row]
end
Ainv[:,i] .= x
end
return Ainv
end
# Thread-safe Gauss-Jordan implementation
function gauss_safe!(A::Matrix{Float64})
n = size(A, 1)
B = copy(A)
I_matrix = Matrix{Float64}(I, n, n)
block_size = 64 # Cache-friendly block size
@inbounds for k in 1:n
# Pivoting (sequential)
pivot_row = k
for i in k+1:n
if abs(B[i, k]) > abs(B[pivot_row, k])
pivot_row = i
end
end
if pivot_row != k
B[k, :], B[pivot_row, :] = B[pivot_row, :], B[k, :]
I_matrix[k, :], I_matrix[pivot_row, :] = I_matrix[pivot_row, :], I_matrix[k, :]
end
pivot = B[k, k]
if abs(pivot) < eps(Float64) * n * sqrt(n)
error("Matrix is numerically singular")
end
# Scale pivot row (parallel safe)
scale = 1.0 / pivot
@inbounds @simd for j in 1:n
B[k, j] *= scale
I_matrix[k, j] *= scale
end
# Elimination (parallel safe)
@inbounds for i in 1:n
if i != k
factor = B[i, k]
@simd for j in 1:n
B[i, j] -= factor * B[k, j]
I_matrix[i, j] -= factor * I_matrix[k, j]
end
end
end
end
return I_matrix
end
# === Selected Implementations ===
# Standard Serial Gauss-Jordan Elimination based Inverse
using LoopVectorization
# Function to compute the inverse using LAPACK's dgetrf! and dgetri!
function lapack_inverse(A::Matrix{Float64})
n = size(A, 1)
n != size(A, 2) && error("Matrix must be square")
# Create a copy since we modify in-place
A_copy = copy(A)
# Step 1: LU factorization (getrf!)
result = LinearAlgebra.LAPACK.getrf!(A_copy)
A_lu = result[1]
ipiv = result[2]
info_getrf = result[3]
if info_getrf > 0
error("Matrix is singular in getrf! at position $info_getrf")
elseif info_getrf < 0
error("Invalid argument in getrf! at position $(-info_getrf)")
end
# Step 2: Matrix inversion (getri!)
A_inv = LinearAlgebra.LAPACK.getri!(A_lu, ipiv)
end
# Function to compute the inverse using LAPACK's dgesv! gesv!(A, B) -> (B, A, ipiv)
function lapack_inverse2(A::Matrix{Float64})
n = size(A, 1)
# Create a copy of A (since gesv! destroys it) and the identity matrix B
A_copy = copy(A)
B = Matrix{Float64}(I, n, n) # Identity matrix of same size
# Solve A_copy * X = B, which gives X = Aβ»ΒΉ
# gesv! modifies A_copy (LU factors) and B (becomes the solution X)
result = LinearAlgebra.LAPACK.gesv!(A_copy,B)
A_lu = result[1]
A_inv = result[2]
return B # The inverse
end
using RecursiveFactorization, LinearAlgebra
function recursivefactor_inverse(A::Matrix{Float64})
n = size(A, 1)
# LU factorization with recursive blocking
F = RecursiveFactorization.lu!(copy(A))
# Allocate output matrix
Ainv = similar(A)
# Solve A * Ainv[:,i] = I[:,i] for each column
for i in 1:n
e = zeros(n)
e[i] = 1.0
Ainv[:, i] = F \ e # Forward/backward substitution
end
return Ainv
end
function recursivefactormt_inverse(A::Matrix{Float64})
n = size(A, 1)
# LU factorization (recursive, single-threaded)
F = RecursiveFactorization.lu!(copy(A))
Ainv = similar(A)
# Multithreaded column-wise back-substitution
Threads.@threads for i in 1:n
e = zeros(n)
e[i] = 1.0
Ainv[:, i] = F \ e # Each thread solves one column
end
return Ainv
end
using StaticArrays, LinearAlgebra
function staticarray_inverse(A::Matrix{Float64})
# Convert to StaticArray (compile-time optimized)
SA = SMatrix{size(A,1),size(A,2)}(A)
# Compute inverse (fully unrolled at compile time)
SA_inv = inv(SA)
# Convert back to regular Matrix (if needed)
return Matrix(SA_inv)
end
function test_fpu()
# Constants
smallsize = 250
smallits = 1000
bigsize = 2000
println("Benchmark running, hopefully as only ACTIVE task")
println("Current time: ", now())
# Warm-up runs
warmup = rand(10,10)
gauss_safe!(copy(warmup))
crout_inverse_single_threaded(copy(warmup))
# Initialize random number pools
Random.seed!(1234) # Fixed seed for reproducibility
pool = rand(smallsize, smallsize, smallits)
pool3 = rand(bigsize, bigsize)
# Working matrices
a = zeros(smallsize, smallsize)
a3 = zeros(bigsize, bigsize)
c = zeros(smallsize, smallsize)
# Timing results
dt = zeros(10)
for n in 1:10
clock1 = time_ns()
if n == 1
# Test 1: Optimized Gauss inversion
# Parallelize across matrices (thread-safe)
Threads.@threads for i in 1:smallits
local_a = copy(@view pool[:, :, i])
local_a = gauss_safe!(local_a) # invert
local_a = gauss_safe!(local_a) # invert back
if i == smallits
a .= local_a # Only last iteration for error check
end
end
# Calculate error (compare to original)
avg_err = sum(abs.(a .- pool[:, :, smallits]))/(smallsize*smallsize)
clock2 = time_ns()
dt[n] = (clock2 - clock1)/1e9
@printf("Test1: GaussST - %d (%dx%d) inverts in %.3f seconds Err= %.16e\n",
smallits, smallsize, smallsize, dt[n], avg_err)
elseif n == 2
# Test 2: Crout inversion crout_inverse_single_threaded
Threads.@threads for i in 1:smallits
local_a = copy(@view pool[:, :, i])
local_a = crout_inverse_single_threaded(local_a) # invert
local_a = crout_inverse_single_threaded(local_a) # invert back
if i == smallits
a .= local_a # Only last iteration for error check
end
end
# Calculate error (compare to original)
avg_err = sum(abs.(a .- pool[:, :, smallits]))/(smallsize*smallsize)
clock2 = time_ns()
dt[n] = (clock2 - clock1)/1e9
@printf("Test2: CroutST - %d (%dx%d) inverts in %.3f seconds Err= %.16e\n",
smallits, smallsize, smallsize, dt[n], avg_err)
elseif n == 3
# Test 3: recursivefactor_inverse
Threads.@threads for i in 1:smallits
local_a = copy(@view pool[:, :, i])
local_a = recursivefactor_inverse(local_a) # invert
local_a = recursivefactor_inverse(local_a) # invert back
if i == smallits
a .= local_a # Only last iteration for error check
end
end
# Calculate error (compare to original)
avg_err = sum(abs.(a .- pool[:, :, smallits]))/(smallsize*smallsize)
clock2 = time_ns()
dt[n] = (clock2 - clock1)/1e9
@printf("Test3: RecursiST- %d (%dx%d) inverts in %.3f seconds Err= %.16e\n",
smallits, smallsize, smallsize, dt[n], avg_err)
elseif n == 4
# Test 4: Crout mt inversion crout_inverse_threaded
@inbounds for i in 1:smallits
local_a = copy(@view pool[:, :, i])
local_a = croutmt_inverse_optimized(local_a) # invert
local_a = croutmt_inverse_optimized(local_a) # invert back
if i == smallits
a .= local_a # Only last iteration for error check
end
end
# Calculate error (compare to original)
avg_err = sum(abs.(a .- pool[:, :, smallits]))/(smallsize*smallsize)
clock2 = time_ns()
dt[n] = (clock2 - clock1)/1e9
@printf("Test4: CroutMT - %d (%dx%d) inverts in %.3f seconds Err= %.16e\n",
smallits, smallsize, smallsize, dt[n], avg_err)
elseif n == 5
# Test 5 Crout inversion Largesize
a3 = copy(pool3)
a3 = croutmt_inverse_optimized(a3) # invert a
a3 = croutmt_inverse_optimized(a3) # invert back
# Calculate error (compare to original)
avg_err = sum(abs.(a3 .- pool3))/(bigsize*bigsize)
clock2 = time_ns()
dt[n] = (clock2 - clock1)/1e9
@printf("Test5: CroutMT - %d (%dx%d) inverts in %.3f seconds Err= %.16e\n",
2, bigsize, bigsize, dt[n], avg_err)
elseif n == 6
# Test 6: Lapack inversion Largesize
a3 = copy(pool3)
a3 = lapack_inverse(a3) # invert a
a3 = lapack_inverse(a3) # invert back
# Calculate error (compare to original)
avg_err = sum(abs.(a3 .- pool3))/(bigsize*bigsize)
clock2 = time_ns()
dt[n] = (clock2 - clock1)/1e9
@printf("Test6: DGETRF/I - %d (%dx%d) inverts in %.3f seconds Err= %.16e\n",
2, bigsize, bigsize, dt[n], avg_err)
elseif n == 7
# Test 7: Lapack2 inversion Largesize
a3 = copy(pool3)
a3 = lapack_inverse2(a3) # invert a
a3 = lapack_inverse2(a3) # invert back
# Calculate error (compare to original)
avg_err = sum(abs.(a3 .- pool3))/(bigsize*bigsize)
clock2 = time_ns()
dt[n] = (clock2 - clock1)/1e9
@printf("Test7: DGESV - %d (%dx%d) inverts in %.3f seconds Err= %.16e\n",
2, bigsize, bigsize, dt[n], avg_err)
elseif n == 8
# Test 8: native
a3 = copy(pool3)
a3 = inv(a3) # invert a
a3 = inv(a3) # invert back
# Calculate error (compare to original)
avg_err = sum(abs.(a3 .- pool3))/(bigsize*bigsize)
clock2 = time_ns()
dt[n] = (clock2 - clock1)/1e9
@printf("Test8: inv(A) - %d (%dx%d) inverts in %.3f seconds Err= %.16e\n",
2, bigsize, bigsize, dt[n], avg_err)
elseif n == 9
# Test 9: recursive
a3 = copy(pool3)
a3 = recursivefactormt_inverse(a3) # invert a
a3 = recursivefactormt_inverse(a3) # invert back
# Calculate error (compare to original)
avg_err = sum(abs.(a3 .- pool3))/(bigsize*bigsize)
clock2 = time_ns()
dt[n] = (clock2 - clock1)/1e9
@printf("Test9: RecursiMT- %d (%dx%d) inverts in %.3f seconds Err= %.16e\n",
2, bigsize, bigsize, dt[n], avg_err)
end
end
@printf(" Total = %.1f sec\n", sum(dt))
println()
end
# Run the benchmark
test_fpu()
when i checked double inverse singular matrix, error is still resonable. === Testing Singular/Ill-Conditioned Matrices (250Γ250) ===
Matrix: Singular
CroutMT : Error = 1.138e-02 (inverted β inverted-back)
LAPACK : Error = 2.081e-02 (inverted β inverted-back)
RecursiveMT : Error = 6.575e-02 (inverted β inverted-back)