AOCL (not MKL) acceleration on AMD Ryzen CPU's

Hi, @Roberto_Bagyo , I still have to see your code carefully, but I already have noticed an error in your Julia code.
In order to use the mkl library, this has to be imported before to import LinearAlgebra, if not, your code is using the default open lapack or openblas.
So, the header of the Julia script, should be:

using MKL
using Random, LinearAlgebra, Dates, Printf
using LoopVectorization
using .Threads
using LinearAlgebra.BLAS
BLAS.set_num_threads(Sys.CPU_THREADS)
using RecursiveFactorization

Could you repeat the benchmark with this correction, please?

Yes, certainly. using MKL was put in the below using LinearAlgebra when i was benchmarking inverse benchmark using fortran openblas, fortran intelmkl, fortran aocl, and julia mkl. As we can see actually the julia was using MKL. see the printout:
BLAS backend libraries:
LBTConfig([ILP64] mkl_rt.2.dll, [LP64] mkl_rt.2.dll)

But i will run again with using MKL on top of the line.

C:\Temp>julia -t 24 -O3 julia_inv.jl

Threading Summary:

Base.Threads Configuration:
Threads.nthreads(): 24

BLAS Configuration:
BLAS vendor: lbt
BLAS.get_num_threads(): 12
BLAS backend libraries:
LBTConfig([ILP64] mkl_rt.2.dll, [LP64] mkl_rt.2.dll)
BLAS Vendor: lbt
BLAS Library Path Details:
  Interface: ilp64
  Path: C:\Users\arypr\.julia\artifacts\ddee95fd48c91e0aadc2aad6f311ae4ccdbbb01e\bin\mkl_rt.2.dll
  Interface: lp64
  Path: C:\Users\arypr\.julia\artifacts\ddee95fd48c91e0aadc2aad6f311ae4ccdbbb01e\bin\mkl_rt.2.dll
Benchmark running, hopefully as only ACTIVE task
Current time: 2025-05-21T23:49:57.776
Test1: GaussST  - 1000 (250x250) inverts in 6.223 seconds  Err= 4.2673054395647724e-15
Test2: CroutST  - 1000 (250x250) inverts in 2.174 seconds  Err= 1.5946885695561353e-12
Test3: RecursiST- 1000 (250x250) inverts in 1.346 seconds  Err= 2.4250148167423610e-14
Test4: CroutMT  - 1000 (250x250) inverts in 4.365 seconds  Err= 1.5946885695561353e-12
Test5: CroutMT  - 2 (2000x2000)  inverts in 3.632 seconds  Err= 3.4721243365073723e-10
Test6: DGETRF/I - 2 (2000x2000)  inverts in 0.190 seconds  Err= 5.2270673897888225e-13
Test7: DGESV    - 2 (2000x2000)  inverts in 0.204 seconds  Err= 5.2866519987893978e-13
Test8: inv(A)   - 2 (2000x2000)  inverts in 0.235 seconds  Err= 5.2270673897888225e-13
Test9: RecursiMT- 2 (2000x2000)  inverts in 1.058 seconds  Err= 5.7157070595421282e-13
                             Total = 19.4 sec

This day i am able comparing head-to-head BLAS DGESV in 1 julia input code. in here the openblas is set as a default LapackBLAS, then inv(A) should be using the default BLAS. the enviroment where MKL LapackBLAS mkl_rt.2.dll shall be seek first, and AOCL AOCL-LibFlame-Win-MT-dll.dll. here the input file:


using Random, LinearAlgebra, Dates, Printf
using LinearAlgebra.BLAS
BLAS.set_num_threads(Sys.CPU_THREADS)
using Libdl # Added for explicit library lapack BLAS loading external

# set AOCL and MKL path
# Define the library names at the global scope for both ILP64 and LP64.
# Provide full paths to the AOCL-LibFlame-Win-MT-dll.dll for each version.
const LIBFLAME_ILP64 = "C:\\Program Files\\AMD\\AOCL-Windows\\amd-libflame\\lib\\ILP64\\AOCL-LibFlame-Win-MT-dll.dll" # Assuming iLP64 DLL
const LIBFLAME_LP64 = "C:\\Program Files\\AMD\\AOCL-Windows\\amd-libflame\\lib\\LP64\\AOCL-LibFlame-Win-MT-dll.dll" # Assuming LP64 DLL 
const LIBMKL = "C:\\Users\\arypr\\.julia\\artifacts\\ddee95fd48c91e0aadc2aad6f311ae4ccdbbb01e\\bin\\mkl_rt.2.dll" # if package julia already installed via import Pkg; Pkg.add("MKL")

# Explicitly load the libraries to ensure they are found
global aocl_libflame_ilp64_handle = nothing
global aocl_libflame_lp64_handle = nothing
global mkl_lib_handle = nothing
# try loading external DLL AOCL and MKL
try
    global aocl_libflame_ilp64_handle = Libdl.dlopen(LIBFLAME_ILP64)
    println("Successfully loaded AOCL-LibFlame-Win-MT-dll.dll (ILP64).")
catch e
    @error "Could not load AOCL-LibFlame-Win-MT-dll.dll (ILP64). Please ensure the path is correct." exception=(e, catch_backtrace())
end

try
    global aocl_libflame_lp64_handle = Libdl.dlopen(LIBFLAME_LP64)
    println("Successfully loaded AOCL-LibFlame-Win-MT-dll.dll (LP64).")
catch e
    @warn "Could not load AOCL-LibFlame-Win-MT-dll.dll (LP64). If you intend to use only ILP64, this warning can be ignored. Otherwise, please ensure the path is correct." exception=(e, catch_backtrace())
end

try
    global mkl_lib_handle = Libdl.dlopen(LIBMKL)
    println("Successfully loaded MKL library.")
catch e
    @error "Could not load MKL library. Please ensure the path is correct." exception=(e, catch_backtrace())
end


# Check which BLAS library is being used
println("\nBLAS Configuration:")
println("BLAS vendor: ", BLAS.vendor())  # MKL, OpenBLAS, etc.

# Get BLAS thread count (works for MKL/OpenBLAS)
println("BLAS.get_num_threads(): ", BLAS.get_num_threads())
println("BLAS backend libraries:")
println(BLAS.get_config())
println("BLAS Vendor: ", BLAS.vendor()) # Identifies :MKL, :OpenBLAS, :BLIS, etc.
println("BLAS Library Path Details:")
config = BLAS.get_config()
for lib in config.loaded_libs
    println("  Interface: ", lib.interface)
    println("  Path: ", lib.libname)
    # The path might contain version numbers, e.g., .../mkl/2024.0/...
    # or .../openblas/0.3.23/...
end


# cpumodel
import Base.Sys
println("\nCPU Information (from Base.Sys) :")
println("CPU Microarchitecture (Sys.CPU_NAME): ", Sys.CPU_NAME)
println("Number of logical CPU threads (Sys.CPU_THREADS): ", Sys.CPU_THREADS)
cpu_infos = Sys.cpu_info()
if !isempty(cpu_infos)
    println("  Model (from first core): ", cpu_infos[1].model)
    for i in 1:min(length(cpu_infos), Sys.CPU_THREADS) # Print for each logical thread up to CPU_THREADS
        # Speed might vary or be 0 if not determinable for all cores this way
        println("  Logical Core $(i-1): Speed: ", cpu_infos[i].speed, " MHz")
    end
else
    println("  Could not retrieve detailed CPU model/speed information from Sys.cpu_info().")
end
println()

# === Selected Implementations ===
# Function to compute the inverse using LAPACK's dgetrf! and dgetri!
function lapack_inverse(A::Matrix{Float64})
    n = size(A, 1)
    n != size(A, 2) && error("Matrix must be square")

    # Create a copy since we modify in-place
    A_copy = copy(A)

    # Step 1: LU factorization (getrf!)
    result = LinearAlgebra.LAPACK.getrf!(A_copy)
    A_lu = result[1]
    ipiv = result[2]
    info_getrf = result[3]

    if info_getrf > 0
        error("Matrix is singular in getrf! at position $info_getrf")
    elseif info_getrf < 0
        error("Invalid argument in getrf! at position $(-info_getrf)")
    end

    # Step 2: Matrix inversion (getri!)
    A_inv = LinearAlgebra.LAPACK.getri!(A_lu, ipiv)
end

# Function to compute the inverse using LAPACK's dgesv! gesv!(A, B) -> (B, A, ipiv)
function lapack_inverse2(A::Matrix{Float64})
    n = size(A, 1)
    # Create a copy of A (since gesv! destroys it) and the identity matrix B
    A_copy = copy(A)
    B = Matrix{Float64}(I, n, n)  # Identity matrix of same size
    
    # Solve A_copy * X = B, which gives X = Aโปยน
    # gesv! modifies A_copy (LU factors) and B (becomes the solution X)
    result = LinearAlgebra.LAPACK.gesv!(A_copy,B)

    A_lu  = result[1]
    A_inv = result[2]
    
    return B  # The inverse
end


# Calls the DGESV routine from AOCL's LIBFLAME library using the ILP64 interface.
# $AX = B$. On exit, `A` contains the LU factorization of the original matrix,
# and `B` contains the solution matrix $X$.
# Arguments:
# - `A::Matrix{Float64}`: The coefficient matrix. It will be overwritten with the LU factors.
# - `B::Matrix{Float64}`: The right-hand side matrix. It will be overwritten with the solution.

function _aocl_dgesv_ilp64!(A::Matrix{Float64}, B::Matrix{Float64})
    # Check if the ILP64 library was loaded
    if aocl_libflame_ilp64_handle === nothing
        error("AOCL ILP64 library not loaded. Cannot call _aocl_dgesv_ilp64!.")
    end

    # Get dimensions
    n = size(A, 1)
    nrhs = size(B, 2)
    lda = n
    ldb = n

    # Allocate memory for pivot indices
    # LAPACK's IPIV array typically uses Int32 for indices even in ILP64
    ipiv = Vector{Int32}(undef, n)

    # INFO variable to store the return code from DGESV
    info = Ref{Int64}(0) # DGESV's INFO is typically an Int64 for ILP64

    # Call DGESV from AOCL's LIBFLAME library (ILP64 version)
    ccall((:dgesv_, LIBFLAME_ILP64),
          Cvoid, # Return type is void (Fortran subroutines don't return values directly)
          (Ref{Int64},     # N (64-bit integer for ILP64)
           Ref{Int64},     # NRHS (64-bit integer for ILP64)
           Ptr{Float64},   # A (input/output)
           Ref{Int64},     # LDA (64-bit integer for ILP64)
           Ptr{Int32},     # IPIV (output, typically 32-bit even for ILP64)
           Ptr{Float64},   # B (input/output)
           Ref{Int64},     # LDB (64-bit integer for ILP64)
           Ref{Int64}),    # INFO (output, 64-bit integer for ILP64)
          n,
          nrhs,
          A, # Pass the array directly for Ptr
          lda,
          ipiv, # Pass the array directly for Ptr
          B, # Pass the array directly for Ptr
          ldb,
          info)

    # Check the INFO value for errors
    if info[] < 0
        @error " DGESV (ILP64): Argument $(abs(info[])) had an illegal value."
    elseif info[] > 0
        @warn " DGESV (ILP64): U($(info[]),$(info[])) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution could not be computed."
    end

    return A, B, ipiv, info[]
end


function _aocl_dgesv_lp64!(A::Matrix{Float64}, B::Matrix{Float64})
    # Check if the LP64 library was loaded
    if aocl_libflame_lp64_handle === nothing
        error("AOCL LP64 library not loaded. Cannot call _aocl_dgesv_lp64!.")
    end

    # Get dimensions
    n = size(A, 1)
    nrhs = size(B, 2)
    lda = n
    ldb = n

    # Allocate memory for pivot indices
    ipiv = Vector{Int32}(undef, n) # LAPACK's IPIV array typically uses Int32

    # INFO variable to store the return code from DGESV
    info = Ref{Int32}(0) # DGESV's INFO is typically an Int32 for LP64

    # Call DGESV from AOCL's LIBFLAME library (LP64 version)
    ccall((:dgesv_, LIBFLAME_LP64),
          Cvoid, # Return type is void
          (Ref{Int32},     # N (32-bit integer for LP64)
           Ref{Int32},     # NRHS (32-bit integer for LP64)
           Ptr{Float64},   # A (input/output)
           Ref{Int32},     # LDA (32-bit integer for LP64)
           Ptr{Int32},     # IPIV (output)
           Ptr{Float64},   # B (input/output)
           Ref{Int32},     # LDB (32-bit integer for LP64)
           Ref{Int32}),    # INFO (output, 32-bit integer for LP64)
          n,
          nrhs,
          A,
          lda,
          ipiv,
          B,
          ldb,
          info)

    # Check the INFO value for errors
    if info[] < 0
        @error " DGESV (LP64): Argument $(abs(info[])) had an illegal value."
    elseif info[] > 0
        @warn " DGESV (LP64): U($(info[]),$(info[])) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution could not be computed."
    end

    return A, B, ipiv, info[]
end


function aocl_inverse(A::Matrix{Float64})
    n = size(A, 1)
    if size(A, 2) != n
        error("Input matrix must be square.")
    end

    # Create a copy of A (since dgesv! modifies it)
    A_copy = copy(A)
    # Create an identity matrix B, which will become the inverse
    B = Matrix{Float64}(I, n, n)

    # Automatically select between ILP64 and LP64 based on matrix dimension
    if n > typemax(Int32)
        # Use ILP64 for very large matrices
        println(" Using AOCL DGESV (ILP64) for matrix size $n.")
        A_lu, A_inv, ipiv, info = _aocl_dgesv_ilp64!(A_copy, B)
    else
        # Use LP64 for smaller matrices (or if ILP64 not loaded)
        # Prioritize LP64 if available, otherwise fall back to ILP64 if LP64 failed to load
        if aocl_libflame_lp64_handle !== nothing
            println(" Using AOCL DGESV (LP64) for matrix size $n.")
            A_lu, A_inv, ipiv, info = _aocl_dgesv_lp64!(A_copy, B)
        elseif aocl_libflame_ilp64_handle !== nothing
            println(" LP64 library not loaded or matrix too large for Int32. Falling back to ILP64 for matrix size $n.")
            A_lu, A_inv, ipiv, info = _aocl_dgesv_ilp64!(A_copy, B)
        else
            error("Neither AOCL LP64 nor ILP64 libraries are loaded. Cannot compute inverse.")
        end
    end

    # Return the inverse (which is now in B)
    return A_inv
end

# _mkl_dgesv!(A, B)
# Calls the DGESV routine from the MKL library.
# Arguments and returns are similar to AOCL dgesv functions.
function _mkl_dgesv!(A::Matrix{Float64}, B::Matrix{Float64})
    # Check if the MKL library was loaded
    if mkl_lib_handle === nothing
        error("MKL library not loaded. Cannot call _mkl_dgesv!.")
    end

    n = size(A, 1)
    nrhs = size(B, 2)
    lda = n
    ldb = n

    ipiv = Vector{Int32}(undef, n)
    info = Ref{Int32}(0) # MKL's DGESV typically uses Int32 for INFO in LP64

    # Call DGESV from MKL library (assuming LP64 interface by default)
    ccall((:dgesv_, LIBMKL),
          Cvoid,
          (Ref{Int32},     # N
           Ref{Int32},     # NRHS
           Ptr{Float64},   # A
           Ref{Int32},     # LDA
           Ptr{Int32},     # IPIV
           Ptr{Float64},   # B
           Ref{Int32},     # LDB
           Ref{Int32}),    # INFO
          n,
          nrhs,
          A,
          lda,
          ipiv,
          B,
          ldb,
          info)

    if info[] < 0
        @error "DGESV (MKL): Argument $(abs(info[])) had an illegal value."
    elseif info[] > 0
        @warn "DGESV (MKL): U($(info[]),$(info[])) is exactly zero. The factorization has been completed, but the factor U is exactly singular, so the solution could not be computed."
    end

    return A, B, ipiv, info[]
end

# mkl_inverse(A::Matrix{Float64})
# Computes the inverse of a square matrix `A` using MKL's DGESV routine via ccall.
# Arguments:
# - `A::Matrix{Float64}`: The matrix to invert.
# Returns:
# - `Matrix{Float64}`: The inverse of matrix `A`.
function mkl_inverse(A::Matrix{Float64})
    n = size(A, 1)
    if size(A, 2) != n
        error("Input matrix must be square.")
    end

    A_copy = copy(A)
    B = Matrix{Float64}(I, n, n)

    A_lu, A_inv, ipiv, info = _mkl_dgesv!(A_copy, B)

    return A_inv
end



function test_fpu()
    # Constants

    # Get command-line arguments (default bigsize = 2000 if not provided)
    bigsize = length(ARGS) >= 1 ? parse(Int, ARGS[1]) : 2000

    println("Benchmark running, hopefully as only ACTIVE task with matrix size = $bigsize")
    println("Current time: ", now())

    # Warm-up runs

    # Initialize random number pools
    Random.seed!(1234)  # Fixed seed for reproducibility
    pool3 = rand(bigsize, bigsize)
    
    # Working matrices
    a3 = zeros(bigsize, bigsize)

    # Timing results
    dt = zeros(10)
    
    for n in 1:10
        clock1 = time_ns()

        if n == 1
            # Test 1: native
            a3 = copy(pool3)
            a3 = inv(a3)  # invert a
            a3 = inv(a3)  # invert back
            
            # Calculate error (compare to original)
            avg_err = sum(abs.(a3 .- pool3))/(bigsize*bigsize)
            clock2 = time_ns()
            dt[n] = (clock2 - clock1)/1e9
            @printf("Test1: Inv(A) native Julia       - %d (%dx%d)  inverts in %.3f seconds  Err= %.16e\n",
                    2, bigsize, bigsize, dt[n], avg_err)

        elseif n == 2
            # Test 2: OpenBLAS DGETRF DGETRI Largesize
            a3 = copy(pool3)
            a3 = lapack_inverse(a3)  # invert a
            a3 = lapack_inverse(a3)  # invert back
            
            # Calculate error (compare to original)
            avg_err = sum(abs.(a3 .- pool3))/(bigsize*bigsize)
            clock2 = time_ns()
            dt[n] = (clock2 - clock1)/1e9
            println(" Default BLAS configuration used: ",BLAS.get_config())
            @printf("Test2: Default DGETRF/I          - %d (%dx%d)  inverts in %.3f seconds  Err= %.16e\n",
                    2, bigsize, bigsize, dt[n], avg_err)

        elseif n == 3
            # Test 3: OpenBLAS DGESV inversion Largesize
            a3 = copy(pool3)
            a3 = lapack_inverse2(a3)  # invert a
            a3 = lapack_inverse2(a3)  # invert back
            
            # Calculate error (compare to original)
            avg_err = sum(abs.(a3 .- pool3))/(bigsize*bigsize)
            clock2 = time_ns()
            dt[n] = (clock2 - clock1)/1e9
            @printf("Test3: Default DGESV             - %d (%dx%d)  inverts in %.3f seconds  Err= %.16e\n",
                    2, bigsize, bigsize, dt[n], avg_err)

        elseif n == 4
            # Test 4: AOCL DGESV inversion Largesize
            a3 = copy(pool3)
            a3 = aocl_inverse(a3)  # invert a
            a3 = aocl_inverse(a3)  # invert back
            
            # Calculate error (compare to original)
            avg_err = sum(abs.(a3 .- pool3))/(bigsize*bigsize)
            clock2 = time_ns()
            dt[n] = (clock2 - clock1)/1e9
            @printf("Test4: AMD AOCL DGESV            - %d (%dx%d)  inverts in %.3f seconds  Err= %.16e\n",
                    2, bigsize, bigsize, dt[n], avg_err)


        elseif n == 5
            # Test 5: AOCL DGESV inversion Largesize
            a3 = copy(pool3)
            a3 = mkl_inverse(a3)  # invert a
            a3 = mkl_inverse(a3)  # invert back
            
            # Calculate error (compare to original)
            avg_err = sum(abs.(a3 .- pool3))/(bigsize*bigsize)
            clock2 = time_ns()
            dt[n] = (clock2 - clock1)/1e9
            @printf("Test5: IntelOneAPI MKL DGESV     - %d (%dx%d)  inverts in %.3f seconds  Err= %.16e\n",
                    2, bigsize, bigsize, dt[n], avg_err)

        end
    end
    
    @printf("                             Total = %.1f sec\n", sum(dt))
    println()
end



# Run the benchmark
test_fpu()

and the benchmark result was close.

C:\Temp>julia -t 24 -O3 julia_lapack_inv.jl 5000
Successfully loaded AOCL-LibFlame-Win-MT-dll.dll (ILP64).
Successfully loaded AOCL-LibFlame-Win-MT-dll.dll (LP64).
Successfully loaded MKL library.

BLAS Configuration:
BLAS vendor: lbt
BLAS.get_num_threads(): 24
BLAS backend libraries:
LBTConfig([ILP64] libopenblas64_.dll)
BLAS Vendor: lbt
BLAS Library Path Details:
  Interface: ilp64
  Path: C:\Users\arypr\AppData\Local\Programs\Julia-1.11.5\bin\libopenblas64_.dll

CPU Information (from Base.Sys) :
CPU Microarchitecture (Sys.CPU_NAME): znver2
Number of logical CPU threads (Sys.CPU_THREADS): 24
  Model (from first core): AMD Ryzen 9 3900X 12-Core Processor
Benchmark running, hopefully as only ACTIVE task with matrix size = 5000
Current time: 2025-05-21T23:58:07.752
Test1: Inv(A) native Julia       - 2 (5000x5000)  inverts in 2.777 seconds  Err= 1.0081402699866127e-11
 Default BLAS configuration used: LBTConfig([ILP64] libopenblas64_.dll)
Test2: Default DGETRF/I          - 2 (5000x5000)  inverts in 2.857 seconds  Err= 1.0081402699866127e-11
Test3: Default DGESV             - 2 (5000x5000)  inverts in 2.673 seconds  Err= 9.8338022082319472e-12
 Using AOCL DGESV (LP64) for matrix size 5000.
 Using AOCL DGESV (LP64) for matrix size 5000.
Test4: AMD AOCL DGESV            - 2 (5000x5000)  inverts in 2.182 seconds  Err= 1.6792224055342770e-11
Test5: IntelOneAPI MKL DGESV     - 2 (5000x5000)  inverts in 2.154 seconds  Err= 2.7218871004215922e-11
                             Total = 12.6 sec

one thing. AMD fellas is quite busy, but he will reach you. the fastest one actually by request technical assistance form in https://www.amd.com/en/forms/contact-us/support.html . I was contacting since AMD create AOCL library in windows without creating AOCC in windows. i said probably amd can make the aocc mingw like gcc/gfortran native windows (mingw but look native) by equation.com, or clang windows (unfortunately no flang windows likely). unfortunately i was busy for big project when they ask to meet online.

in case of DGEMM. quite fascinating result.

DGEMM GFLOPS Summary:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Matrix Size โ”‚ Octavian โ”‚ Default BLAS โ”‚ AOCL (DLL) โ”‚ MKL (DLL) โ”‚ Peakflops โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚   128x128   โ”‚  230.30  โ”‚    24.84     โ”‚   189.31   โ”‚  122.89   โ”‚   28.00   โ”‚
โ”‚   256x256   โ”‚  449.30  โ”‚    77.58     โ”‚   147.46   โ”‚  221.55   โ”‚   70.54   โ”‚
โ”‚   512x512   โ”‚  467.50  โ”‚    172.39    โ”‚   361.68   โ”‚  227.01   โ”‚  153.46   โ”‚
โ”‚   640x640   โ”‚  444.55  โ”‚    208.44    โ”‚   375.75   โ”‚  378.26   โ”‚  152.27   โ”‚
โ”‚   768x768   โ”‚  421.31  โ”‚    301.05    โ”‚   392.33   โ”‚  410.50   โ”‚  188.03   โ”‚
โ”‚   896x896   โ”‚  357.17  โ”‚    257.41    โ”‚   440.58   โ”‚  278.41   โ”‚  202.51   โ”‚
โ”‚  1024x1024  โ”‚  316.05  โ”‚    286.42    โ”‚   469.72   โ”‚  246.86   โ”‚  192.62   โ”‚
โ”‚  2048x2048  โ”‚  339.41  โ”‚    260.60    โ”‚   466.76   โ”‚  283.62   โ”‚  271.34   โ”‚
โ”‚  4096x4096  โ”‚  383.93  โ”‚    260.55    โ”‚   489.17   โ”‚  362.43   โ”‚  321.95   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜


As you can see the AOCL is quite win here, except small matrix where octavian wins.
here the code.

# Julia Roofline and DGEMM Benchmark Script
#
# This script translates concepts from the provided roofline.f90:
# 1. A STREAM Benchmark (Copy, Scale, Add, Triad) for memory bandwidth,
#    with dynamic array sizing based on cache and comparison of single/multi-threaded performance.
# 2. A DGEMM (matrix multiplication) benchmark for GFLOPS, tested with:
#    - Octavian.jl for matrix multiplication.
#    - Julia's default BLAS (typically OpenBLAS).
#    - Intel MKL (via explicit DLL call).
#    - AOCL BLIS (via explicit DLL call).
#
# To run this script, you might need to install the following packages:
# using Pkg
# Pkg.add(["PrettyTables", "Octavian", "Statistics", "Hwloc"])

using LinearAlgebra
import Dates
using Printf
import Libdl # For explicit library loading and ccall
using PrettyTables # For formatting the final results table
using Octavian # For Octavian DGEMM
using Statistics # For mean()
using Hwloc # For cache size detection
using Base.Threads # For @threads macro and nthreads()

println("Julia STREAM Roofline and DGEMM Benchmark")
println("Timestamp: ", Dates.now())
println("Number of Julia threads available: ", Threads.nthreads()) # Total Julia threads
println("Number of CPU threads (Sys.CPU_THREADS): ", Sys.CPU_THREADS) # Logical cores
println("---------------------------------------------------\n")

# Define BLAS Integer type (typically Int64 for ILP64 interface)
const BlasInt = LinearAlgebra.BLAS.BlasInt

# --- User-defined DLL Paths ---
# Ensure these paths are correct for your system.
const LIBMKL_PATH = "C:\\Users\\arypr\\.julia\\artifacts\\ddee95fd48c91e0aadc2aad6f311ae4ccdbbb01e\\bin\\mkl_rt.2.dll"
const LIBBLIS_ILP64_PATH =  "C:\\Program Files\\AMD\\AOCL-Windows\\amd-blis\\lib\\ILP64\\AOCL-LibBlis-Win-MT-dll.dll"
const LIBBLIS_LP64_PATH  =  "C:\\Program Files\\AMD\\AOCL-Windows\\amd-blis\\lib\\LP64\\AOCL-LibBlis-Win-MT-dll.dll" 

# Explicitly load the libraries to ensure they are found
global aocl_libblis_ilp64_handle = nothing
global aocl_libblis_lp64_handle = nothing
global mkl_lib_handle = nothing
# try loading external DLL AOCL and MKL
try
    global aocl_libblis_ilp64_handle = Libdl.dlopen(LIBBLIS_ILP64_PATH)
    println("Successfully loaded AOCL-LibBlis-Win-MT-dll.dll (ILP64).")
catch e
    @error "Could not load AOCL-LibBlis-Win-MT-dll.dll (ILP64). Please ensure the path is correct." exception=(e, catch_backtrace())
end

try
    global aocl_libblis_lp64_handle = Libdl.dlopen(LIBBLIS_LP64_PATH)
    println("Successfully loaded AOCL-LibBlis-Win-MT-dll.dll (LP64).")
catch e
    @warn "Could not load AOCL-LibBlis-Win-MT-dll.dll (LP64). If you intend to use only ILP64, this warning can be ignored. Otherwise, please ensure the path is correct." exception=(e, catch_backtrace())
end

try
    global mkl_lib_handle = Libdl.dlopen(LIBMKL_PATH)
    println("Successfully loaded MKL library : mkl_rt.2.dll")
catch e
    @error "Could not load MKL library. Please ensure the path is correct." exception=(e, catch_backtrace())
end


# --- Functions for Cache Detection and Array Sizing (from user) ---
function get_cpu_cache_sizes() # Renamed to avoid conflict if Hwloc has a similar name
    try
        # Hwloc.jl provides various ways to get cache info.
        # l[1-3]cache_sizes() usually return vectors of sizes for each cache of that level.
        # We take the maximum, assuming uniform cache sizes or wanting the largest.
        l1_sizes = Hwloc.l1cache_sizes()
        l2_sizes = Hwloc.l2cache_sizes()
        l3_sizes = Hwloc.l3cache_sizes()
        
        l1 = !isempty(l1_sizes) ? maximum(l1_sizes) : 0
        l2 = !isempty(l2_sizes) ? maximum(l2_sizes) : 0
        l3 = !isempty(l3_sizes) ? maximum(l3_sizes) : 0
        
        if l1 == 0 && l2 == 0 && l3 == 0
            @warn "Hwloc cache detection returned all zeros. Check Hwloc.jl functionality on this system."
            return nothing
        end
        return (L1=l1, L2=l2, L3=l3)
    catch e
        @warn "Hwloc Cache detection failed: $e. Hwloc.jl might not be installed or working correctly."
        return nothing
    end
end

function determine_stream_array_size(;default_min_elements=10_000_000, cache_multiplier=2.0)
    println("Determining STREAM array size...")
    cache_info = get_cpu_cache_sizes()
    
    if cache_info !== nothing && (cache_info.L1 > 0 || cache_info.L2 > 0 || cache_info.L3 > 0)
        largest_cache_bytes = max(cache_info.L1, cache_info.L2, cache_info.L3)
        # Number of Float64 elements to fill N times the largest cache
        # Each element is sizeof(Float64) bytes.
        # We need 3 such arrays for Triad, so to make each array N times cache,
        # total memory for 3 arrays will be 3 * N * cache.
        # The rule of thumb is that the total size of arrays should be >> cache.
        # If `largest_cache_bytes` is the size of L3, we want each array `a,b,c` to be large enough.
        # Let's aim for the total memory of the 3 arrays to be `cache_multiplier` times the largest cache.
        # So, 3 * elements_per_array * sizeof(Float64) = cache_multiplier * largest_cache_bytes
        # elements_per_array = (cache_multiplier * largest_cache_bytes) / (3 * sizeof(Float64))
        # A simpler rule: make each array at least `cache_multiplier` times the largest cache.
        elements_for_cache_factor = ceil(Int, cache_multiplier * largest_cache_bytes / sizeof(Float64))
        
        recommended_size = max(default_min_elements, elements_for_cache_factor)
        
        println("Detected CPU cache sizes (max per level):")
        println("  L1: ", cache_info.L1 > 0 ? cache_info.L1 รท 1024 : "N/A", " KB")
        println("  L2: ", cache_info.L2 > 0 ? cache_info.L2 รท 1024 : "N/A", " KB")
        println("  L3: ", cache_info.L3 > 0 ? cache_info.L3 รท (1024^2) : "N/A", " MB")
        total_memory_for_recommended = 3 * recommended_size * sizeof(Float64) / (1024^2)
        println("Recommended STREAM array size for each of a,b,c: ", recommended_size, " elements.")
        println("  (aiming for each array to be about ", Printf.@sprintf("%.2f", cache_multiplier), "x largest cache, or default min)")
        println("  This will use approx. ", Printf.@sprintf("%.2f", total_memory_for_recommended), " MB for 3 arrays.")
        return recommended_size
    end
    @warn "Using default STREAM array size: $default_min_elements elements."
    return default_min_elements
end


function set_optimal_threads(; max_threads=nothing)
    # Get system information
    total_threads = Sys.CPU_THREADS
    logical_cores = Sys.CPU_THREADS
    
    # Apply maximum threads limit if specified
    available_threads = total_threads
    if !isnothing(max_threads) && max_threads > 0
        available_threads = min(total_threads, max_threads)
    end
    
    # Set MKL threads
    if !haskey(ENV, "MKL_NUM_THREADS")
        ENV["MKL_NUM_THREADS"] = string(available_threads)
        LinearAlgebra.BLAS.set_num_threads(available_threads)
    end
    
    # Set OMP threads
    if !haskey(ENV, "OMP_NUM_THREADS")
        ENV["OMP_NUM_THREADS"] = string(available_threads)
    end
    
    # Print configuration report
    println("\nThread Configuration:")
    println("Available threads: ", available_threads)
    println("MKL_NUM_THREADS: ", ENV["MKL_NUM_THREADS"])
    println("OMP_NUM_THREADS: ", get(ENV, "OMP_NUM_THREADS", "not set"))
    println("Active BLAS threads: ", LinearAlgebra.BLAS.get_num_threads())
    
    # Return the settings - using more compatible syntax
    omp_value = tryparse(Int, get(ENV, "OMP_NUM_THREADS", "0"))
    omp_threads = omp_value === nothing ? 0 : omp_value
    
    return (
        logical_cores = logical_cores,
        blas_threads = LinearAlgebra.BLAS.get_num_threads(),
        mkl_threads = parse(Int, ENV["MKL_NUM_THREADS"]),
        omp_threads = omp_threads
    )
end


# --- STREAM Benchmark (Single and Multi-threaded) ---
function stream_benchmark_single_or_multi(; array_size::Int, ntimes::Int, use_threads::Bool)
    thread_str = use_threads ? "Multi-threaded ($(Threads.nthreads()) Julia threads)" : "Single-threaded"
    println("\nRunning STREAM Benchmark ($thread_str)...")

    total_mb_per_array = sizeof(Float64) * array_size / 1e6
    total_mb_all_arrays = 3 * total_mb_per_array
    println("Array size = ", array_size, " (", Printf.@sprintf("%.2f", total_mb_per_array), " MB per array)")
    println("Allocating ", Printf.@sprintf("%.2f", total_mb_all_arrays), " MB total for 3 arrays (a, b, c)")

    a = zeros(Float64, array_size)
    b = zeros(Float64, array_size)
    c = zeros(Float64, array_size)
    scalar = 3.0

    # Warmup (use threads if multi-threading is enabled for benchmark)
    if use_threads
        @inbounds @threads for i in 1:array_size; a[i]=1.0; b[i]=2.0; c[i]=0.0; end
    else
        @inbounds for i in 1:array_size; a[i]=1.0; b[i]=2.0; c[i]=0.0; end
    end

    # Define ST and MT versions of kernels
    function copy_st!(out, inp) @inbounds for i in eachindex(out); out[i] = inp[i]; end; end
    function scale_st!(out, inp, s) @inbounds for i in eachindex(out); out[i] = s * inp[i]; end; end
    function add_st!(out, inp1, inp2) @inbounds for i in eachindex(out); out[i] = inp1[i] + inp2[i]; end; end
    function triad_st!(out, inp1, inp2, s) @inbounds for i in eachindex(out); out[i] = inp1[i] + s * inp2[i]; end; end

    function copy_mt!(out, inp) @inbounds @threads for i in eachindex(out); out[i] = inp[i]; end; end
    function scale_mt!(out, inp, s) @inbounds @threads for i in eachindex(out); out[i] = s * inp[i]; end; end
    function add_mt!(out, inp1, inp2) @inbounds @threads for i in eachindex(out); out[i] = inp1[i] + inp2[i]; end; end
    function triad_mt!(out, inp1, inp2, s) @inbounds @threads for i in eachindex(out); out[i] = inp1[i] + s * inp2[i]; end; end

    _copy!  = use_threads ? copy_mt!  : copy_st!
    _scale! = use_threads ? scale_mt! : scale_st!
    _add!   = use_threads ? add_mt!   : add_st!
    _triad! = use_threads ? triad_mt! : triad_st!
    
    times = Dict("Copy" => zeros(ntimes), "Scale" => zeros(ntimes), "Add" => zeros(ntimes), "Triad" => zeros(ntimes))
    
    for trial in 1:ntimes
        # Standard STREAM sequence:
        # 1. c = a (Copy)
        # 2. b = scalar * c (Scale)
        # 3. c = a + b (Add)
        # 4. a = b + scalar * c (Triad)
        # To ensure inputs are somewhat consistent for each op type across trials,
        # we could re-init a,b,c here, but let's follow the typical sequential dependency.
        # For the very first trial, a,b,c are from the warmup.
        # For subsequent trials, they hold results from the previous trial's last operation.
        # This is common in how STREAM is often run.

        times["Copy"][trial]  = @elapsed _copy!(c, a)
        times["Scale"][trial] = @elapsed _scale!(b, c, scalar)
        times["Add"][trial]   = @elapsed _add!(c, a, b)
        times["Triad"][trial] = @elapsed _triad!(a, b, c, scalar)
    end
    
    bytes_per_array_val = sizeof(Float64) * array_size # Renamed to avoid conflict
    bytes_transferred = Dict(
        "Copy" => 2 * bytes_per_array_val, "Scale" => 2 * bytes_per_array_val,
        "Add" => 3 * bytes_per_array_val,  "Triad" => 3 * bytes_per_array_val
    )
    
    results = Dict{String, NamedTuple{(:MBs, :GBs, :AvgTimeMs), Tuple{Float64, Float64, Float64}}}()
    slice_range = ntimes == 1 ? (1:1) : (2:ntimes) # Exclude first run if ntimes > 1

    for name in ["Copy", "Scale", "Add", "Triad"]
        avg_time_s = ntimes == 1 ? times[name][1] : mean(times[name][slice_range])
        if avg_time_s > 0
            mbs = (bytes_transferred[name] / avg_time_s) / 1e6
            gbs = mbs / 1000.0
            results[name] = (MBs=mbs, GBs=gbs, AvgTimeMs=avg_time_s * 1000)
        else
            results[name] = (MBs=NaN, GBs=NaN, AvgTimeMs=NaN)
        end
    end
    
    println("STREAM Results ($thread_str):")
    println("Number of trials = ", ntimes, (ntimes > 1 ? " (first trial excluded from averages)" : ""))
    println("--------------------------------------------------------------------")
    @printf("%-10s %12s %12s %15s\n", "Function", "MB/s", "GB/s", "Avg Time (ms)")
    println("--------------------------------------------------------------------")
    for name in ["Copy", "Scale", "Add", "Triad"]
        res = results[name]
        @printf("%-10s %12.2f %12.2f %15.4f\n", name, res.MBs, res.GBs, res.AvgTimeMs)
    end
    println("--------------------------------------------------------------------\n")
    
    a = b = c = nothing; GC.gc()
    return results # Return detailed results
end


# --- DGEMM Implementations ---

# 1. Octavian DGEMM: C = alpha*A*B + beta*C
function octavian_dgemm!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64}, alpha::Float64, beta::Float64)
    Octavian.matmul!(C, A, B, alpha, beta)
    return C
end

# 2. Wrapper for LinearAlgebra.mul! (used for Default BLAS)
function default_blas_dgemm!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64}, alpha::Float64, beta::Float64)
    # LinearAlgebra.mul!(C, A, B, alpha, beta)
    LinearAlgebra.BLAS.gemm!('N', 'N', alpha, A, B, beta, C)
    return C
end

# 3. DGEMM via explicit MKL DLL call
function mkl_explicit_dgemm!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64}, alpha::Float64, beta::Float64)
    M, N_ = size(C)
    K = size(A, 2)
    if !(size(A,1) == M && size(B,1) == K && size(B,2) == N_) error("Dimension mismatch") end

    transa = 'N'
    transb = 'N'
    m = BlasInt(M)
    n = BlasInt(N_)
    k = BlasInt(K)
    lda = BlasInt(max(1, stride(A,2)))
    ldb = BlasInt(max(1, stride(B,2)))
    ldc = BlasInt(max(1, stride(C,2)))
    
    alpha_ref = Ref(alpha)
    beta_ref = Ref(beta)

    handle = C_NULL
    try
        handle = Libdl.dlopen(LIBMKL_PATH)
        dgemm_sym = Libdl.dlsym(handle, :dgemm) 

        ccall(dgemm_sym, Cvoid,
              (Ref{UInt8}, Ref{UInt8}, Ref{BlasInt}, Ref{BlasInt}, Ref{BlasInt},
               Ref{Float64}, Ptr{Float64}, Ref{BlasInt}, Ptr{Float64}, Ref{BlasInt},
               Ref{Float64}, Ptr{Float64}, Ref{BlasInt}),
              transa, transb, m, n, k,
              alpha_ref, A, lda, B, ldb,
              beta_ref, C, ldc)
    catch e
        println("Error calling MKL dgemm via DLL: $e")
        fill!(C, NaN)
    finally
        if handle != C_NULL
            Libdl.dlclose(handle)
        end
    end
    return C
end

# 4. DGEMM via explicit AOCL (LibBLIS) DLL call
function aocl_explicit_dgemm!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64}, alpha::Float64, beta::Float64)
    M, N_ = size(C)
    K = size(A, 2)
    if !(size(A,1) == M && size(B,1) == K && size(B,2) == N_) error("Dimension mismatch") end

    transa = 'N'
    transb = 'N'
    m = BlasInt(M)
    n = BlasInt(N_)
    k = BlasInt(K)
    lda = BlasInt(max(1, stride(A,2)))
    ldb = BlasInt(max(1, stride(B,2)))
    ldc = BlasInt(max(1, stride(C,2)))

    alpha_ref = Ref(alpha)
    beta_ref = Ref(beta)
    
    handle = C_NULL
    try
         # Automatically select between ILP64 and LP64 based on matrix dimension
         if n > typemax(Int32)
         # println(" Using AOCL DGEMM (ILP64) for matrix size $n.")
         handle = Libdl.dlopen(LIBBLIS_ILP64_PATH)
        else
         # println(" Using AOCL DGEMM (LP64) for matrix size $n.")
         handle = Libdl.dlopen(LIBBLIS_LP64_PATH)
        end
        dgemm_sym = Libdl.dlsym(handle, :dgemm) 

        ccall(dgemm_sym, Cvoid,
              (Ref{UInt8}, Ref{UInt8}, Ref{BlasInt}, Ref{BlasInt}, Ref{BlasInt},
               Ref{Float64}, Ptr{Float64}, Ref{BlasInt}, Ptr{Float64}, Ref{BlasInt},
               Ref{Float64}, Ptr{Float64}, Ref{BlasInt}),
              transa, transb, m, n, k,
              alpha_ref, A, lda, B, ldb,
              beta_ref, C, ldc)
    catch e
        println("Error calling AOCL dgemm via DLL: $e")
        fill!(C, NaN)
    finally
        if handle != C_NULL
            Libdl.dlclose(handle)
        end
    end
    return C
end


# --- DGEMM Benchmark Routine ---
function run_dgemm_benchmark(dgemm_func!::Function, backend_name::String, matrix_sizes::Vector{Int}, results_dict::Dict)
    println("Running DGEMM Benchmark for: ", backend_name)
    NTIMES_DGEMM = 5 
    REP_DGEMM_INNER = 16
    alpha_val = 1.0 
    beta_val = 0.0  

    println("  Size |   GFLOPS | Intensity (FLOPs/Byte)")
    println("-------|----------|------------------------")

    for s_idx in eachindex(matrix_sizes)
        m_dim = matrix_sizes[s_idx] 
        n_dim = m_dim               
        p_dim = m_dim               

        A_mat = rand(Float64, m_dim, p_dim) 
        B_mat = rand(Float64, p_dim, n_dim) 
        C_mat = zeros(Float64, m_dim, n_dim)
        
        try 
            dgemm_func!(C_mat, A_mat, B_mat, alpha_val, beta_val) # Warm-up
        catch e
            println("Warm-up failed for $backend_name, size $m_dim: $e")
            if !haskey(results_dict, m_dim); results_dict[m_dim] = Dict{String, Float64}(); end
            results_dict[m_dim][backend_name] = NaN
            continue 
        end

        t_min_local_s = Inf 
        valid_run = true
        for iter in 1:NTIMES_DGEMM
            t_start_ns = time_ns()
            for rep_idx in 1:REP_DGEMM_INNER 
                try
                    dgemm_func!(C_mat, A_mat, B_mat, alpha_val, beta_val)
                catch e
                    println("Error during $backend_name DGEMM (size $m_dim, iter $rep_idx): $e") 
                    t_min_local_s = NaN 
                    valid_run = false
                    break 
                end
            end
            if !valid_run; break; end
            t_end_ns = time_ns()
            
            elapsed_s_per_rep = ((t_end_ns - t_start_ns) / 1e9) / REP_DGEMM_INNER
            if !isnan(t_min_local_s) 
                t_min_local_s = min(t_min_local_s, elapsed_s_per_rep)
            end
        end

        gflops = NaN
        intensity = NaN

        if valid_run && !isnan(t_min_local_s) && t_min_local_s > 0 && isfinite(t_min_local_s)
            flops_val = 2.0 * m_dim * n_dim * p_dim 
            bytes_val = (m_dim*p_dim + p_dim*n_dim + m_dim*n_dim) * sizeof(Float64) 
            
            gflops = (flops_val / t_min_local_s) / 1.0e9
            intensity = flops_val / bytes_val
        end

        Printf.@printf(" %5d | %8.2f | %22.3f\n", m_dim, gflops, intensity)
        
        if !haskey(results_dict, m_dim)
            results_dict[m_dim] = Dict{String, Float64}()
        end
        results_dict[m_dim][backend_name] = gflops
    end
    println("---------------------------------------------------\n")
    A_mat = B_mat = C_mat = nothing 
    GC.gc()
end

# --- Main Execution ---
function main()
    # --- STREAM Benchmark Section ---
    # Determine array size based on cache or use default
    # Pass this determined size to both single and multi-threaded benchmarks
    # Set ntimes for STREAM benchmark
    stream_ntimes = 10 # Number of trials for STREAM
    
    # Try to load Hwloc and determine array size, otherwise use a default.
    # This needs to be done outside the function if the function has a default arg that calls it.
    # Or, make determine_stream_array_size callable with a flag to suppress prints if called for default.
    # For now, let's call it once here.
    effective_stream_array_size = determine_stream_array_size(default_min_elements=20_000_000, cache_multiplier=4.0) # Default 20M elements, aim for 4x L3

    println("\n--- Running Single-Threaded STREAM Benchmark ---")
    stream_results_st = stream_benchmark_single_or_multi(array_size=effective_stream_array_size, ntimes=stream_ntimes, use_threads=false)
    
    println("\n--- Running Multi-Threaded STREAM Benchmark ---")
    stream_results_mt = stream_benchmark_single_or_multi(array_size=effective_stream_array_size, ntimes=stream_ntimes, use_threads=true)

    # Print Combined STREAM Results Table
    println("\nCombined STREAM Benchmark Memory Bandwidth Summary (GB/s):")
    stream_header = ["Operation", "Single-Threaded (GB/s)", "Multi-Threaded (GB/s)"]
    stream_table_data = Matrix{Any}(undef, 4, 3)
    stream_ops = ["Copy", "Scale", "Add", "Triad"]
    for (i, op_name) in enumerate(stream_ops)
        stream_table_data[i, 1] = op_name
        stream_table_data[i, 2] = Printf.@sprintf("%.2f", get(stream_results_st, op_name, (GBs=NaN,)).GBs)
        stream_table_data[i, 3] = Printf.@sprintf("%.2f", get(stream_results_mt, op_name, (GBs=NaN,)).GBs)
    end
    pretty_table(stream_table_data, header=stream_header, alignment=:c)
    println("---------------------------------------------------\n")


    # --- DGEMM Benchmark Section ---
    matrix_sizes = [128, 256, 512, 640, 768, 896, 1024, 2048, 4096]
    dgemm_results = Dict{Int, Dict{String, Float64}}() 

    # 1. Octavian DGEMM
    println("Ensuring Octavian is using available Julia threads for its operations.")
    run_dgemm_benchmark(octavian_dgemm!, "Octavian", matrix_sizes, dgemm_results)

    # 2. Default BLAS (likely OpenBLAS)
    println("Current BLAS vendor before explicit loads: ", LinearAlgebra.BLAS.vendor())
    println("BLAS config: ", LinearAlgebra.BLAS.get_config())
    LinearAlgebra.BLAS.set_num_threads(Threads.nthreads())
    println("Number of BLAS threads (BLAS.get_num_threads()): ", LinearAlgebra.BLAS.get_num_threads())
    run_dgemm_benchmark(default_blas_dgemm!, "Default BLAS", matrix_sizes, dgemm_results)

    # 3. MKL DGEMM via explicit DLL call
    set_optimal_threads()
    run_dgemm_benchmark(mkl_explicit_dgemm!, "MKL (DLL)", matrix_sizes, dgemm_results)
    
    # 4. AOCL (LibBLIS) DGEMM via explicit DLL call
    set_optimal_threads()
    run_dgemm_benchmark(aocl_explicit_dgemm!, "AOCL (DLL)", matrix_sizes, dgemm_results)

    # 5. LinearAlgebra.peakflops benchmark for each matrix size
    println("\nRunning LinearAlgebra.peakflops Benchmark for various sizes...")
    for m_size in matrix_sizes
        # LinearAlgebra.peakflops returns GFLOPS. If the output is raw FLOPs, divide by 1e9.
        peak_gflops_raw = try
            LinearAlgebra.peakflops(m_size; eltype=Float64, ntrials=3, parallel=false)
        catch e
            @warn "LinearAlgebra.peakflops failed for size $m_size: $e"
            NaN
        end
        # Assuming peak_gflops_raw might return raw FLOPs in some environments,
        # convert to GFLOPS for consistent display.
        peak_gflops = peak_gflops_raw / 1.0e9
        Printf.@printf("Peak GFLOPS (n=%5d): %8.2f GFLOPS\n", m_size, peak_gflops)

        if !haskey(dgemm_results, m_size)
            dgemm_results[m_size] = Dict{String, Float64}()
        end
        dgemm_results[m_size]["Peakflops"] = peak_gflops
    end
    println("-----------------------------------------")

    
    println("\nDGEMM GFLOPS Summary:")
    
    header = ["Matrix Size", "Octavian", "Default BLAS", "AOCL (DLL)", "MKL (DLL)", "Peakflops"]
    data_table = Matrix{Any}(undef, length(matrix_sizes), length(header))

    for (i, m_size) in enumerate(matrix_sizes)
        data_table[i, 1] = string(m_size, "x", m_size)
        results_for_size = get(dgemm_results, m_size, Dict{String, Float64}())
        data_table[i, 2] = Printf.@sprintf("%.2f", get(results_for_size, "Octavian", NaN))
        data_table[i, 3] = Printf.@sprintf("%.2f", get(results_for_size, "Default BLAS", NaN))
        data_table[i, 4] = Printf.@sprintf("%.2f", get(results_for_size, "AOCL (DLL)", NaN))
        data_table[i, 5] = Printf.@sprintf("%.2f", get(results_for_size, "MKL (DLL)", NaN))
        data_table[i, 6] = Printf.@sprintf("%.2f", get(results_for_size, "Peakflops", NaN)) # Add Peakflops column
    end

    pretty_table(data_table, header=header, alignment=:c)
    
    println("\nBenchmark Complete.")
end

main()

2 Likes