AOCL (not MKL) acceleration on AMD Ryzen CPU's

Roberto_Bagyo · May 21, 2025, 7:58pm

in case of DGEMM. quite fascinating result.

DGEMM GFLOPS Summary:
┌─────────────┬──────────┬──────────────┬────────────┬───────────┬───────────┐
│ Matrix Size │ Octavian │ Default BLAS │ AOCL (DLL) │ MKL (DLL) │ Peakflops │
├─────────────┼──────────┼──────────────┼────────────┼───────────┼───────────┤
│   128x128   │  230.30  │    24.84     │   189.31   │  122.89   │   28.00   │
│   256x256   │  449.30  │    77.58     │   147.46   │  221.55   │   70.54   │
│   512x512   │  467.50  │    172.39    │   361.68   │  227.01   │  153.46   │
│   640x640   │  444.55  │    208.44    │   375.75   │  378.26   │  152.27   │
│   768x768   │  421.31  │    301.05    │   392.33   │  410.50   │  188.03   │
│   896x896   │  357.17  │    257.41    │   440.58   │  278.41   │  202.51   │
│  1024x1024  │  316.05  │    286.42    │   469.72   │  246.86   │  192.62   │
│  2048x2048  │  339.41  │    260.60    │   466.76   │  283.62   │  271.34   │
│  4096x4096  │  383.93  │    260.55    │   489.17   │  362.43   │  321.95   │
└─────────────┴──────────┴──────────────┴────────────┴───────────┴───────────┘

As you can see the AOCL is quite win here, except small matrix where octavian wins.
here the code.

# Julia Roofline and DGEMM Benchmark Script
#
# This script translates concepts from the provided roofline.f90:
# 1. A STREAM Benchmark (Copy, Scale, Add, Triad) for memory bandwidth,
#    with dynamic array sizing based on cache and comparison of single/multi-threaded performance.
# 2. A DGEMM (matrix multiplication) benchmark for GFLOPS, tested with:
#    - Octavian.jl for matrix multiplication.
#    - Julia's default BLAS (typically OpenBLAS).
#    - Intel MKL (via explicit DLL call).
#    - AOCL BLIS (via explicit DLL call).
#
# To run this script, you might need to install the following packages:
# using Pkg
# Pkg.add(["PrettyTables", "Octavian", "Statistics", "Hwloc"])

using LinearAlgebra
import Dates
using Printf
import Libdl # For explicit library loading and ccall
using PrettyTables # For formatting the final results table
using Octavian # For Octavian DGEMM
using Statistics # For mean()
using Hwloc # For cache size detection
using Base.Threads # For @threads macro and nthreads()

println("Julia STREAM Roofline and DGEMM Benchmark")
println("Timestamp: ", Dates.now())
println("Number of Julia threads available: ", Threads.nthreads()) # Total Julia threads
println("Number of CPU threads (Sys.CPU_THREADS): ", Sys.CPU_THREADS) # Logical cores
println("---------------------------------------------------\n")

# Define BLAS Integer type (typically Int64 for ILP64 interface)
const BlasInt = LinearAlgebra.BLAS.BlasInt

# --- User-defined DLL Paths ---
# Ensure these paths are correct for your system.
const LIBMKL_PATH = "C:\\Users\\arypr\\.julia\\artifacts\\ddee95fd48c91e0aadc2aad6f311ae4ccdbbb01e\\bin\\mkl_rt.2.dll"
const LIBBLIS_ILP64_PATH =  "C:\\Program Files\\AMD\\AOCL-Windows\\amd-blis\\lib\\ILP64\\AOCL-LibBlis-Win-MT-dll.dll"
const LIBBLIS_LP64_PATH  =  "C:\\Program Files\\AMD\\AOCL-Windows\\amd-blis\\lib\\LP64\\AOCL-LibBlis-Win-MT-dll.dll" 

# Explicitly load the libraries to ensure they are found
global aocl_libblis_ilp64_handle = nothing
global aocl_libblis_lp64_handle = nothing
global mkl_lib_handle = nothing
# try loading external DLL AOCL and MKL
try
    global aocl_libblis_ilp64_handle = Libdl.dlopen(LIBBLIS_ILP64_PATH)
    println("Successfully loaded AOCL-LibBlis-Win-MT-dll.dll (ILP64).")
catch e
    @error "Could not load AOCL-LibBlis-Win-MT-dll.dll (ILP64). Please ensure the path is correct." exception=(e, catch_backtrace())
end

try
    global aocl_libblis_lp64_handle = Libdl.dlopen(LIBBLIS_LP64_PATH)
    println("Successfully loaded AOCL-LibBlis-Win-MT-dll.dll (LP64).")
catch e
    @warn "Could not load AOCL-LibBlis-Win-MT-dll.dll (LP64). If you intend to use only ILP64, this warning can be ignored. Otherwise, please ensure the path is correct." exception=(e, catch_backtrace())
end

try
    global mkl_lib_handle = Libdl.dlopen(LIBMKL_PATH)
    println("Successfully loaded MKL library : mkl_rt.2.dll")
catch e
    @error "Could not load MKL library. Please ensure the path is correct." exception=(e, catch_backtrace())
end


# --- Functions for Cache Detection and Array Sizing (from user) ---
function get_cpu_cache_sizes() # Renamed to avoid conflict if Hwloc has a similar name
    try
        # Hwloc.jl provides various ways to get cache info.
        # l[1-3]cache_sizes() usually return vectors of sizes for each cache of that level.
        # We take the maximum, assuming uniform cache sizes or wanting the largest.
        l1_sizes = Hwloc.l1cache_sizes()
        l2_sizes = Hwloc.l2cache_sizes()
        l3_sizes = Hwloc.l3cache_sizes()
        
        l1 = !isempty(l1_sizes) ? maximum(l1_sizes) : 0
        l2 = !isempty(l2_sizes) ? maximum(l2_sizes) : 0
        l3 = !isempty(l3_sizes) ? maximum(l3_sizes) : 0
        
        if l1 == 0 && l2 == 0 && l3 == 0
            @warn "Hwloc cache detection returned all zeros. Check Hwloc.jl functionality on this system."
            return nothing
        end
        return (L1=l1, L2=l2, L3=l3)
    catch e
        @warn "Hwloc Cache detection failed: $e. Hwloc.jl might not be installed or working correctly."
        return nothing
    end
end

function determine_stream_array_size(;default_min_elements=10_000_000, cache_multiplier=2.0)
    println("Determining STREAM array size...")
    cache_info = get_cpu_cache_sizes()
    
    if cache_info !== nothing && (cache_info.L1 > 0 || cache_info.L2 > 0 || cache_info.L3 > 0)
        largest_cache_bytes = max(cache_info.L1, cache_info.L2, cache_info.L3)
        # Number of Float64 elements to fill N times the largest cache
        # Each element is sizeof(Float64) bytes.
        # We need 3 such arrays for Triad, so to make each array N times cache,
        # total memory for 3 arrays will be 3 * N * cache.
        # The rule of thumb is that the total size of arrays should be >> cache.
        # If `largest_cache_bytes` is the size of L3, we want each array `a,b,c` to be large enough.
        # Let's aim for the total memory of the 3 arrays to be `cache_multiplier` times the largest cache.
        # So, 3 * elements_per_array * sizeof(Float64) = cache_multiplier * largest_cache_bytes
        # elements_per_array = (cache_multiplier * largest_cache_bytes) / (3 * sizeof(Float64))
        # A simpler rule: make each array at least `cache_multiplier` times the largest cache.
        elements_for_cache_factor = ceil(Int, cache_multiplier * largest_cache_bytes / sizeof(Float64))
        
        recommended_size = max(default_min_elements, elements_for_cache_factor)
        
        println("Detected CPU cache sizes (max per level):")
        println("  L1: ", cache_info.L1 > 0 ? cache_info.L1 ÷ 1024 : "N/A", " KB")
        println("  L2: ", cache_info.L2 > 0 ? cache_info.L2 ÷ 1024 : "N/A", " KB")
        println("  L3: ", cache_info.L3 > 0 ? cache_info.L3 ÷ (1024^2) : "N/A", " MB")
        total_memory_for_recommended = 3 * recommended_size * sizeof(Float64) / (1024^2)
        println("Recommended STREAM array size for each of a,b,c: ", recommended_size, " elements.")
        println("  (aiming for each array to be about ", Printf.@sprintf("%.2f", cache_multiplier), "x largest cache, or default min)")
        println("  This will use approx. ", Printf.@sprintf("%.2f", total_memory_for_recommended), " MB for 3 arrays.")
        return recommended_size
    end
    @warn "Using default STREAM array size: $default_min_elements elements."
    return default_min_elements
end


function set_optimal_threads(; max_threads=nothing)
    # Get system information
    total_threads = Sys.CPU_THREADS
    logical_cores = Sys.CPU_THREADS
    
    # Apply maximum threads limit if specified
    available_threads = total_threads
    if !isnothing(max_threads) && max_threads > 0
        available_threads = min(total_threads, max_threads)
    end
    
    # Set MKL threads
    if !haskey(ENV, "MKL_NUM_THREADS")
        ENV["MKL_NUM_THREADS"] = string(available_threads)
        LinearAlgebra.BLAS.set_num_threads(available_threads)
    end
    
    # Set OMP threads
    if !haskey(ENV, "OMP_NUM_THREADS")
        ENV["OMP_NUM_THREADS"] = string(available_threads)
    end
    
    # Print configuration report
    println("\nThread Configuration:")
    println("Available threads: ", available_threads)
    println("MKL_NUM_THREADS: ", ENV["MKL_NUM_THREADS"])
    println("OMP_NUM_THREADS: ", get(ENV, "OMP_NUM_THREADS", "not set"))
    println("Active BLAS threads: ", LinearAlgebra.BLAS.get_num_threads())
    
    # Return the settings - using more compatible syntax
    omp_value = tryparse(Int, get(ENV, "OMP_NUM_THREADS", "0"))
    omp_threads = omp_value === nothing ? 0 : omp_value
    
    return (
        logical_cores = logical_cores,
        blas_threads = LinearAlgebra.BLAS.get_num_threads(),
        mkl_threads = parse(Int, ENV["MKL_NUM_THREADS"]),
        omp_threads = omp_threads
    )
end


# --- STREAM Benchmark (Single and Multi-threaded) ---
function stream_benchmark_single_or_multi(; array_size::Int, ntimes::Int, use_threads::Bool)
    thread_str = use_threads ? "Multi-threaded ($(Threads.nthreads()) Julia threads)" : "Single-threaded"
    println("\nRunning STREAM Benchmark ($thread_str)...")

    total_mb_per_array = sizeof(Float64) * array_size / 1e6
    total_mb_all_arrays = 3 * total_mb_per_array
    println("Array size = ", array_size, " (", Printf.@sprintf("%.2f", total_mb_per_array), " MB per array)")
    println("Allocating ", Printf.@sprintf("%.2f", total_mb_all_arrays), " MB total for 3 arrays (a, b, c)")

    a = zeros(Float64, array_size)
    b = zeros(Float64, array_size)
    c = zeros(Float64, array_size)
    scalar = 3.0

    # Warmup (use threads if multi-threading is enabled for benchmark)
    if use_threads
        @inbounds @threads for i in 1:array_size; a[i]=1.0; b[i]=2.0; c[i]=0.0; end
    else
        @inbounds for i in 1:array_size; a[i]=1.0; b[i]=2.0; c[i]=0.0; end
    end

    # Define ST and MT versions of kernels
    function copy_st!(out, inp) @inbounds for i in eachindex(out); out[i] = inp[i]; end; end
    function scale_st!(out, inp, s) @inbounds for i in eachindex(out); out[i] = s * inp[i]; end; end
    function add_st!(out, inp1, inp2) @inbounds for i in eachindex(out); out[i] = inp1[i] + inp2[i]; end; end
    function triad_st!(out, inp1, inp2, s) @inbounds for i in eachindex(out); out[i] = inp1[i] + s * inp2[i]; end; end

    function copy_mt!(out, inp) @inbounds @threads for i in eachindex(out); out[i] = inp[i]; end; end
    function scale_mt!(out, inp, s) @inbounds @threads for i in eachindex(out); out[i] = s * inp[i]; end; end
    function add_mt!(out, inp1, inp2) @inbounds @threads for i in eachindex(out); out[i] = inp1[i] + inp2[i]; end; end
    function triad_mt!(out, inp1, inp2, s) @inbounds @threads for i in eachindex(out); out[i] = inp1[i] + s * inp2[i]; end; end

    _copy!  = use_threads ? copy_mt!  : copy_st!
    _scale! = use_threads ? scale_mt! : scale_st!
    _add!   = use_threads ? add_mt!   : add_st!
    _triad! = use_threads ? triad_mt! : triad_st!
    
    times = Dict("Copy" => zeros(ntimes), "Scale" => zeros(ntimes), "Add" => zeros(ntimes), "Triad" => zeros(ntimes))
    
    for trial in 1:ntimes
        # Standard STREAM sequence:
        # 1. c = a (Copy)
        # 2. b = scalar * c (Scale)
        # 3. c = a + b (Add)
        # 4. a = b + scalar * c (Triad)
        # To ensure inputs are somewhat consistent for each op type across trials,
        # we could re-init a,b,c here, but let's follow the typical sequential dependency.
        # For the very first trial, a,b,c are from the warmup.
        # For subsequent trials, they hold results from the previous trial's last operation.
        # This is common in how STREAM is often run.

        times["Copy"][trial]  = @elapsed _copy!(c, a)
        times["Scale"][trial] = @elapsed _scale!(b, c, scalar)
        times["Add"][trial]   = @elapsed _add!(c, a, b)
        times["Triad"][trial] = @elapsed _triad!(a, b, c, scalar)
    end
    
    bytes_per_array_val = sizeof(Float64) * array_size # Renamed to avoid conflict
    bytes_transferred = Dict(
        "Copy" => 2 * bytes_per_array_val, "Scale" => 2 * bytes_per_array_val,
        "Add" => 3 * bytes_per_array_val,  "Triad" => 3 * bytes_per_array_val
    )
    
    results = Dict{String, NamedTuple{(:MBs, :GBs, :AvgTimeMs), Tuple{Float64, Float64, Float64}}}()
    slice_range = ntimes == 1 ? (1:1) : (2:ntimes) # Exclude first run if ntimes > 1

    for name in ["Copy", "Scale", "Add", "Triad"]
        avg_time_s = ntimes == 1 ? times[name][1] : mean(times[name][slice_range])
        if avg_time_s > 0
            mbs = (bytes_transferred[name] / avg_time_s) / 1e6
            gbs = mbs / 1000.0
            results[name] = (MBs=mbs, GBs=gbs, AvgTimeMs=avg_time_s * 1000)
        else
            results[name] = (MBs=NaN, GBs=NaN, AvgTimeMs=NaN)
        end
    end
    
    println("STREAM Results ($thread_str):")
    println("Number of trials = ", ntimes, (ntimes > 1 ? " (first trial excluded from averages)" : ""))
    println("--------------------------------------------------------------------")
    @printf("%-10s %12s %12s %15s\n", "Function", "MB/s", "GB/s", "Avg Time (ms)")
    println("--------------------------------------------------------------------")
    for name in ["Copy", "Scale", "Add", "Triad"]
        res = results[name]
        @printf("%-10s %12.2f %12.2f %15.4f\n", name, res.MBs, res.GBs, res.AvgTimeMs)
    end
    println("--------------------------------------------------------------------\n")
    
    a = b = c = nothing; GC.gc()
    return results # Return detailed results
end


# --- DGEMM Implementations ---

# 1. Octavian DGEMM: C = alpha*A*B + beta*C
function octavian_dgemm!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64}, alpha::Float64, beta::Float64)
    Octavian.matmul!(C, A, B, alpha, beta)
    return C
end

# 2. Wrapper for LinearAlgebra.mul! (used for Default BLAS)
function default_blas_dgemm!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64}, alpha::Float64, beta::Float64)
    # LinearAlgebra.mul!(C, A, B, alpha, beta)
    LinearAlgebra.BLAS.gemm!('N', 'N', alpha, A, B, beta, C)
    return C
end

# 3. DGEMM via explicit MKL DLL call
function mkl_explicit_dgemm!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64}, alpha::Float64, beta::Float64)
    M, N_ = size(C)
    K = size(A, 2)
    if !(size(A,1) == M && size(B,1) == K && size(B,2) == N_) error("Dimension mismatch") end

    transa = 'N'
    transb = 'N'
    m = BlasInt(M)
    n = BlasInt(N_)
    k = BlasInt(K)
    lda = BlasInt(max(1, stride(A,2)))
    ldb = BlasInt(max(1, stride(B,2)))
    ldc = BlasInt(max(1, stride(C,2)))
    
    alpha_ref = Ref(alpha)
    beta_ref = Ref(beta)

    handle = C_NULL
    try
        handle = Libdl.dlopen(LIBMKL_PATH)
        dgemm_sym = Libdl.dlsym(handle, :dgemm) 

        ccall(dgemm_sym, Cvoid,
              (Ref{UInt8}, Ref{UInt8}, Ref{BlasInt}, Ref{BlasInt}, Ref{BlasInt},
               Ref{Float64}, Ptr{Float64}, Ref{BlasInt}, Ptr{Float64}, Ref{BlasInt},
               Ref{Float64}, Ptr{Float64}, Ref{BlasInt}),
              transa, transb, m, n, k,
              alpha_ref, A, lda, B, ldb,
              beta_ref, C, ldc)
    catch e
        println("Error calling MKL dgemm via DLL: $e")
        fill!(C, NaN)
    finally
        if handle != C_NULL
            Libdl.dlclose(handle)
        end
    end
    return C
end

# 4. DGEMM via explicit AOCL (LibBLIS) DLL call
function aocl_explicit_dgemm!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64}, alpha::Float64, beta::Float64)
    M, N_ = size(C)
    K = size(A, 2)
    if !(size(A,1) == M && size(B,1) == K && size(B,2) == N_) error("Dimension mismatch") end

    transa = 'N'
    transb = 'N'
    m = BlasInt(M)
    n = BlasInt(N_)
    k = BlasInt(K)
    lda = BlasInt(max(1, stride(A,2)))
    ldb = BlasInt(max(1, stride(B,2)))
    ldc = BlasInt(max(1, stride(C,2)))

    alpha_ref = Ref(alpha)
    beta_ref = Ref(beta)
    
    handle = C_NULL
    try
         # Automatically select between ILP64 and LP64 based on matrix dimension
         if n > typemax(Int32)
         # println(" Using AOCL DGEMM (ILP64) for matrix size $n.")
         handle = Libdl.dlopen(LIBBLIS_ILP64_PATH)
        else
         # println(" Using AOCL DGEMM (LP64) for matrix size $n.")
         handle = Libdl.dlopen(LIBBLIS_LP64_PATH)
        end
        dgemm_sym = Libdl.dlsym(handle, :dgemm) 

        ccall(dgemm_sym, Cvoid,
              (Ref{UInt8}, Ref{UInt8}, Ref{BlasInt}, Ref{BlasInt}, Ref{BlasInt},
               Ref{Float64}, Ptr{Float64}, Ref{BlasInt}, Ptr{Float64}, Ref{BlasInt},
               Ref{Float64}, Ptr{Float64}, Ref{BlasInt}),
              transa, transb, m, n, k,
              alpha_ref, A, lda, B, ldb,
              beta_ref, C, ldc)
    catch e
        println("Error calling AOCL dgemm via DLL: $e")
        fill!(C, NaN)
    finally
        if handle != C_NULL
            Libdl.dlclose(handle)
        end
    end
    return C
end


# --- DGEMM Benchmark Routine ---
function run_dgemm_benchmark(dgemm_func!::Function, backend_name::String, matrix_sizes::Vector{Int}, results_dict::Dict)
    println("Running DGEMM Benchmark for: ", backend_name)
    NTIMES_DGEMM = 5 
    REP_DGEMM_INNER = 16
    alpha_val = 1.0 
    beta_val = 0.0  

    println("  Size |   GFLOPS | Intensity (FLOPs/Byte)")
    println("-------|----------|------------------------")

    for s_idx in eachindex(matrix_sizes)
        m_dim = matrix_sizes[s_idx] 
        n_dim = m_dim               
        p_dim = m_dim               

        A_mat = rand(Float64, m_dim, p_dim) 
        B_mat = rand(Float64, p_dim, n_dim) 
        C_mat = zeros(Float64, m_dim, n_dim)
        
        try 
            dgemm_func!(C_mat, A_mat, B_mat, alpha_val, beta_val) # Warm-up
        catch e
            println("Warm-up failed for $backend_name, size $m_dim: $e")
            if !haskey(results_dict, m_dim); results_dict[m_dim] = Dict{String, Float64}(); end
            results_dict[m_dim][backend_name] = NaN
            continue 
        end

        t_min_local_s = Inf 
        valid_run = true
        for iter in 1:NTIMES_DGEMM
            t_start_ns = time_ns()
            for rep_idx in 1:REP_DGEMM_INNER 
                try
                    dgemm_func!(C_mat, A_mat, B_mat, alpha_val, beta_val)
                catch e
                    println("Error during $backend_name DGEMM (size $m_dim, iter $rep_idx): $e") 
                    t_min_local_s = NaN 
                    valid_run = false
                    break 
                end
            end
            if !valid_run; break; end
            t_end_ns = time_ns()
            
            elapsed_s_per_rep = ((t_end_ns - t_start_ns) / 1e9) / REP_DGEMM_INNER
            if !isnan(t_min_local_s) 
                t_min_local_s = min(t_min_local_s, elapsed_s_per_rep)
            end
        end

        gflops = NaN
        intensity = NaN

        if valid_run && !isnan(t_min_local_s) && t_min_local_s > 0 && isfinite(t_min_local_s)
            flops_val = 2.0 * m_dim * n_dim * p_dim 
            bytes_val = (m_dim*p_dim + p_dim*n_dim + m_dim*n_dim) * sizeof(Float64) 
            
            gflops = (flops_val / t_min_local_s) / 1.0e9
            intensity = flops_val / bytes_val
        end

        Printf.@printf(" %5d | %8.2f | %22.3f\n", m_dim, gflops, intensity)
        
        if !haskey(results_dict, m_dim)
            results_dict[m_dim] = Dict{String, Float64}()
        end
        results_dict[m_dim][backend_name] = gflops
    end
    println("---------------------------------------------------\n")
    A_mat = B_mat = C_mat = nothing 
    GC.gc()
end

# --- Main Execution ---
function main()
    # --- STREAM Benchmark Section ---
    # Determine array size based on cache or use default
    # Pass this determined size to both single and multi-threaded benchmarks
    # Set ntimes for STREAM benchmark
    stream_ntimes = 10 # Number of trials for STREAM
    
    # Try to load Hwloc and determine array size, otherwise use a default.
    # This needs to be done outside the function if the function has a default arg that calls it.
    # Or, make determine_stream_array_size callable with a flag to suppress prints if called for default.
    # For now, let's call it once here.
    effective_stream_array_size = determine_stream_array_size(default_min_elements=20_000_000, cache_multiplier=4.0) # Default 20M elements, aim for 4x L3

    println("\n--- Running Single-Threaded STREAM Benchmark ---")
    stream_results_st = stream_benchmark_single_or_multi(array_size=effective_stream_array_size, ntimes=stream_ntimes, use_threads=false)
    
    println("\n--- Running Multi-Threaded STREAM Benchmark ---")
    stream_results_mt = stream_benchmark_single_or_multi(array_size=effective_stream_array_size, ntimes=stream_ntimes, use_threads=true)

    # Print Combined STREAM Results Table
    println("\nCombined STREAM Benchmark Memory Bandwidth Summary (GB/s):")
    stream_header = ["Operation", "Single-Threaded (GB/s)", "Multi-Threaded (GB/s)"]
    stream_table_data = Matrix{Any}(undef, 4, 3)
    stream_ops = ["Copy", "Scale", "Add", "Triad"]
    for (i, op_name) in enumerate(stream_ops)
        stream_table_data[i, 1] = op_name
        stream_table_data[i, 2] = Printf.@sprintf("%.2f", get(stream_results_st, op_name, (GBs=NaN,)).GBs)
        stream_table_data[i, 3] = Printf.@sprintf("%.2f", get(stream_results_mt, op_name, (GBs=NaN,)).GBs)
    end
    pretty_table(stream_table_data, header=stream_header, alignment=:c)
    println("---------------------------------------------------\n")


    # --- DGEMM Benchmark Section ---
    matrix_sizes = [128, 256, 512, 640, 768, 896, 1024, 2048, 4096]
    dgemm_results = Dict{Int, Dict{String, Float64}}() 

    # 1. Octavian DGEMM
    println("Ensuring Octavian is using available Julia threads for its operations.")
    run_dgemm_benchmark(octavian_dgemm!, "Octavian", matrix_sizes, dgemm_results)

    # 2. Default BLAS (likely OpenBLAS)
    println("Current BLAS vendor before explicit loads: ", LinearAlgebra.BLAS.vendor())
    println("BLAS config: ", LinearAlgebra.BLAS.get_config())
    LinearAlgebra.BLAS.set_num_threads(Threads.nthreads())
    println("Number of BLAS threads (BLAS.get_num_threads()): ", LinearAlgebra.BLAS.get_num_threads())
    run_dgemm_benchmark(default_blas_dgemm!, "Default BLAS", matrix_sizes, dgemm_results)

    # 3. MKL DGEMM via explicit DLL call
    set_optimal_threads()
    run_dgemm_benchmark(mkl_explicit_dgemm!, "MKL (DLL)", matrix_sizes, dgemm_results)
    
    # 4. AOCL (LibBLIS) DGEMM via explicit DLL call
    set_optimal_threads()
    run_dgemm_benchmark(aocl_explicit_dgemm!, "AOCL (DLL)", matrix_sizes, dgemm_results)

    # 5. LinearAlgebra.peakflops benchmark for each matrix size
    println("\nRunning LinearAlgebra.peakflops Benchmark for various sizes...")
    for m_size in matrix_sizes
        # LinearAlgebra.peakflops returns GFLOPS. If the output is raw FLOPs, divide by 1e9.
        peak_gflops_raw = try
            LinearAlgebra.peakflops(m_size; eltype=Float64, ntrials=3, parallel=false)
        catch e
            @warn "LinearAlgebra.peakflops failed for size $m_size: $e"
            NaN
        end
        # Assuming peak_gflops_raw might return raw FLOPs in some environments,
        # convert to GFLOPS for consistent display.
        peak_gflops = peak_gflops_raw / 1.0e9
        Printf.@printf("Peak GFLOPS (n=%5d): %8.2f GFLOPS\n", m_size, peak_gflops)

        if !haskey(dgemm_results, m_size)
            dgemm_results[m_size] = Dict{String, Float64}()
        end
        dgemm_results[m_size]["Peakflops"] = peak_gflops
    end
    println("-----------------------------------------")

    
    println("\nDGEMM GFLOPS Summary:")
    
    header = ["Matrix Size", "Octavian", "Default BLAS", "AOCL (DLL)", "MKL (DLL)", "Peakflops"]
    data_table = Matrix{Any}(undef, length(matrix_sizes), length(header))

    for (i, m_size) in enumerate(matrix_sizes)
        data_table[i, 1] = string(m_size, "x", m_size)
        results_for_size = get(dgemm_results, m_size, Dict{String, Float64}())
        data_table[i, 2] = Printf.@sprintf("%.2f", get(results_for_size, "Octavian", NaN))
        data_table[i, 3] = Printf.@sprintf("%.2f", get(results_for_size, "Default BLAS", NaN))
        data_table[i, 4] = Printf.@sprintf("%.2f", get(results_for_size, "AOCL (DLL)", NaN))
        data_table[i, 5] = Printf.@sprintf("%.2f", get(results_for_size, "MKL (DLL)", NaN))
        data_table[i, 6] = Printf.@sprintf("%.2f", get(results_for_size, "Peakflops", NaN)) # Add Peakflops column
    end

    pretty_table(data_table, header=header, alignment=:c)
    
    println("\nBenchmark Complete.")
end

main()

Topic		Replies	Views
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36845	June 19, 2020
Linear solver \(A, B) performance vs Matlab A\b General Usage	32	7842	May 21, 2017
Performance gotcha in linear algebra lu() General Usage performance , linearalgebra	33	3735	February 11, 2020
LU factorization performance issue New to Julia linearalgebra	30	804	June 6, 2022
@inbounds: is the compiler now so smart that this is no longer necessary? Performance	33	3017	July 16, 2018

AOCL (not MKL) acceleration on AMD Ryzen CPU's

Related topics