in case of DGEMM. quite fascinating result.
DGEMM GFLOPS Summary:
โโโโโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโฌโโโโโโโโโโโโ
โ Matrix Size โ Octavian โ Default BLAS โ AOCL (DLL) โ MKL (DLL) โ Peakflops โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโค
โ 128x128 โ 230.30 โ 24.84 โ 189.31 โ 122.89 โ 28.00 โ
โ 256x256 โ 449.30 โ 77.58 โ 147.46 โ 221.55 โ 70.54 โ
โ 512x512 โ 467.50 โ 172.39 โ 361.68 โ 227.01 โ 153.46 โ
โ 640x640 โ 444.55 โ 208.44 โ 375.75 โ 378.26 โ 152.27 โ
โ 768x768 โ 421.31 โ 301.05 โ 392.33 โ 410.50 โ 188.03 โ
โ 896x896 โ 357.17 โ 257.41 โ 440.58 โ 278.41 โ 202.51 โ
โ 1024x1024 โ 316.05 โ 286.42 โ 469.72 โ 246.86 โ 192.62 โ
โ 2048x2048 โ 339.41 โ 260.60 โ 466.76 โ 283.62 โ 271.34 โ
โ 4096x4096 โ 383.93 โ 260.55 โ 489.17 โ 362.43 โ 321.95 โ
โโโโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโโโโ
As you can see the AOCL is quite win here, except small matrix where octavian wins.
here the code.
# Julia Roofline and DGEMM Benchmark Script
#
# This script translates concepts from the provided roofline.f90:
# 1. A STREAM Benchmark (Copy, Scale, Add, Triad) for memory bandwidth,
# with dynamic array sizing based on cache and comparison of single/multi-threaded performance.
# 2. A DGEMM (matrix multiplication) benchmark for GFLOPS, tested with:
# - Octavian.jl for matrix multiplication.
# - Julia's default BLAS (typically OpenBLAS).
# - Intel MKL (via explicit DLL call).
# - AOCL BLIS (via explicit DLL call).
#
# To run this script, you might need to install the following packages:
# using Pkg
# Pkg.add(["PrettyTables", "Octavian", "Statistics", "Hwloc"])
using LinearAlgebra
import Dates
using Printf
import Libdl # For explicit library loading and ccall
using PrettyTables # For formatting the final results table
using Octavian # For Octavian DGEMM
using Statistics # For mean()
using Hwloc # For cache size detection
using Base.Threads # For @threads macro and nthreads()
println("Julia STREAM Roofline and DGEMM Benchmark")
println("Timestamp: ", Dates.now())
println("Number of Julia threads available: ", Threads.nthreads()) # Total Julia threads
println("Number of CPU threads (Sys.CPU_THREADS): ", Sys.CPU_THREADS) # Logical cores
println("---------------------------------------------------\n")
# Define BLAS Integer type (typically Int64 for ILP64 interface)
const BlasInt = LinearAlgebra.BLAS.BlasInt
# --- User-defined DLL Paths ---
# Ensure these paths are correct for your system.
const LIBMKL_PATH = "C:\\Users\\arypr\\.julia\\artifacts\\ddee95fd48c91e0aadc2aad6f311ae4ccdbbb01e\\bin\\mkl_rt.2.dll"
const LIBBLIS_ILP64_PATH = "C:\\Program Files\\AMD\\AOCL-Windows\\amd-blis\\lib\\ILP64\\AOCL-LibBlis-Win-MT-dll.dll"
const LIBBLIS_LP64_PATH = "C:\\Program Files\\AMD\\AOCL-Windows\\amd-blis\\lib\\LP64\\AOCL-LibBlis-Win-MT-dll.dll"
# Explicitly load the libraries to ensure they are found
global aocl_libblis_ilp64_handle = nothing
global aocl_libblis_lp64_handle = nothing
global mkl_lib_handle = nothing
# try loading external DLL AOCL and MKL
try
global aocl_libblis_ilp64_handle = Libdl.dlopen(LIBBLIS_ILP64_PATH)
println("Successfully loaded AOCL-LibBlis-Win-MT-dll.dll (ILP64).")
catch e
@error "Could not load AOCL-LibBlis-Win-MT-dll.dll (ILP64). Please ensure the path is correct." exception=(e, catch_backtrace())
end
try
global aocl_libblis_lp64_handle = Libdl.dlopen(LIBBLIS_LP64_PATH)
println("Successfully loaded AOCL-LibBlis-Win-MT-dll.dll (LP64).")
catch e
@warn "Could not load AOCL-LibBlis-Win-MT-dll.dll (LP64). If you intend to use only ILP64, this warning can be ignored. Otherwise, please ensure the path is correct." exception=(e, catch_backtrace())
end
try
global mkl_lib_handle = Libdl.dlopen(LIBMKL_PATH)
println("Successfully loaded MKL library : mkl_rt.2.dll")
catch e
@error "Could not load MKL library. Please ensure the path is correct." exception=(e, catch_backtrace())
end
# --- Functions for Cache Detection and Array Sizing (from user) ---
function get_cpu_cache_sizes() # Renamed to avoid conflict if Hwloc has a similar name
try
# Hwloc.jl provides various ways to get cache info.
# l[1-3]cache_sizes() usually return vectors of sizes for each cache of that level.
# We take the maximum, assuming uniform cache sizes or wanting the largest.
l1_sizes = Hwloc.l1cache_sizes()
l2_sizes = Hwloc.l2cache_sizes()
l3_sizes = Hwloc.l3cache_sizes()
l1 = !isempty(l1_sizes) ? maximum(l1_sizes) : 0
l2 = !isempty(l2_sizes) ? maximum(l2_sizes) : 0
l3 = !isempty(l3_sizes) ? maximum(l3_sizes) : 0
if l1 == 0 && l2 == 0 && l3 == 0
@warn "Hwloc cache detection returned all zeros. Check Hwloc.jl functionality on this system."
return nothing
end
return (L1=l1, L2=l2, L3=l3)
catch e
@warn "Hwloc Cache detection failed: $e. Hwloc.jl might not be installed or working correctly."
return nothing
end
end
function determine_stream_array_size(;default_min_elements=10_000_000, cache_multiplier=2.0)
println("Determining STREAM array size...")
cache_info = get_cpu_cache_sizes()
if cache_info !== nothing && (cache_info.L1 > 0 || cache_info.L2 > 0 || cache_info.L3 > 0)
largest_cache_bytes = max(cache_info.L1, cache_info.L2, cache_info.L3)
# Number of Float64 elements to fill N times the largest cache
# Each element is sizeof(Float64) bytes.
# We need 3 such arrays for Triad, so to make each array N times cache,
# total memory for 3 arrays will be 3 * N * cache.
# The rule of thumb is that the total size of arrays should be >> cache.
# If `largest_cache_bytes` is the size of L3, we want each array `a,b,c` to be large enough.
# Let's aim for the total memory of the 3 arrays to be `cache_multiplier` times the largest cache.
# So, 3 * elements_per_array * sizeof(Float64) = cache_multiplier * largest_cache_bytes
# elements_per_array = (cache_multiplier * largest_cache_bytes) / (3 * sizeof(Float64))
# A simpler rule: make each array at least `cache_multiplier` times the largest cache.
elements_for_cache_factor = ceil(Int, cache_multiplier * largest_cache_bytes / sizeof(Float64))
recommended_size = max(default_min_elements, elements_for_cache_factor)
println("Detected CPU cache sizes (max per level):")
println(" L1: ", cache_info.L1 > 0 ? cache_info.L1 รท 1024 : "N/A", " KB")
println(" L2: ", cache_info.L2 > 0 ? cache_info.L2 รท 1024 : "N/A", " KB")
println(" L3: ", cache_info.L3 > 0 ? cache_info.L3 รท (1024^2) : "N/A", " MB")
total_memory_for_recommended = 3 * recommended_size * sizeof(Float64) / (1024^2)
println("Recommended STREAM array size for each of a,b,c: ", recommended_size, " elements.")
println(" (aiming for each array to be about ", Printf.@sprintf("%.2f", cache_multiplier), "x largest cache, or default min)")
println(" This will use approx. ", Printf.@sprintf("%.2f", total_memory_for_recommended), " MB for 3 arrays.")
return recommended_size
end
@warn "Using default STREAM array size: $default_min_elements elements."
return default_min_elements
end
function set_optimal_threads(; max_threads=nothing)
# Get system information
total_threads = Sys.CPU_THREADS
logical_cores = Sys.CPU_THREADS
# Apply maximum threads limit if specified
available_threads = total_threads
if !isnothing(max_threads) && max_threads > 0
available_threads = min(total_threads, max_threads)
end
# Set MKL threads
if !haskey(ENV, "MKL_NUM_THREADS")
ENV["MKL_NUM_THREADS"] = string(available_threads)
LinearAlgebra.BLAS.set_num_threads(available_threads)
end
# Set OMP threads
if !haskey(ENV, "OMP_NUM_THREADS")
ENV["OMP_NUM_THREADS"] = string(available_threads)
end
# Print configuration report
println("\nThread Configuration:")
println("Available threads: ", available_threads)
println("MKL_NUM_THREADS: ", ENV["MKL_NUM_THREADS"])
println("OMP_NUM_THREADS: ", get(ENV, "OMP_NUM_THREADS", "not set"))
println("Active BLAS threads: ", LinearAlgebra.BLAS.get_num_threads())
# Return the settings - using more compatible syntax
omp_value = tryparse(Int, get(ENV, "OMP_NUM_THREADS", "0"))
omp_threads = omp_value === nothing ? 0 : omp_value
return (
logical_cores = logical_cores,
blas_threads = LinearAlgebra.BLAS.get_num_threads(),
mkl_threads = parse(Int, ENV["MKL_NUM_THREADS"]),
omp_threads = omp_threads
)
end
# --- STREAM Benchmark (Single and Multi-threaded) ---
function stream_benchmark_single_or_multi(; array_size::Int, ntimes::Int, use_threads::Bool)
thread_str = use_threads ? "Multi-threaded ($(Threads.nthreads()) Julia threads)" : "Single-threaded"
println("\nRunning STREAM Benchmark ($thread_str)...")
total_mb_per_array = sizeof(Float64) * array_size / 1e6
total_mb_all_arrays = 3 * total_mb_per_array
println("Array size = ", array_size, " (", Printf.@sprintf("%.2f", total_mb_per_array), " MB per array)")
println("Allocating ", Printf.@sprintf("%.2f", total_mb_all_arrays), " MB total for 3 arrays (a, b, c)")
a = zeros(Float64, array_size)
b = zeros(Float64, array_size)
c = zeros(Float64, array_size)
scalar = 3.0
# Warmup (use threads if multi-threading is enabled for benchmark)
if use_threads
@inbounds @threads for i in 1:array_size; a[i]=1.0; b[i]=2.0; c[i]=0.0; end
else
@inbounds for i in 1:array_size; a[i]=1.0; b[i]=2.0; c[i]=0.0; end
end
# Define ST and MT versions of kernels
function copy_st!(out, inp) @inbounds for i in eachindex(out); out[i] = inp[i]; end; end
function scale_st!(out, inp, s) @inbounds for i in eachindex(out); out[i] = s * inp[i]; end; end
function add_st!(out, inp1, inp2) @inbounds for i in eachindex(out); out[i] = inp1[i] + inp2[i]; end; end
function triad_st!(out, inp1, inp2, s) @inbounds for i in eachindex(out); out[i] = inp1[i] + s * inp2[i]; end; end
function copy_mt!(out, inp) @inbounds @threads for i in eachindex(out); out[i] = inp[i]; end; end
function scale_mt!(out, inp, s) @inbounds @threads for i in eachindex(out); out[i] = s * inp[i]; end; end
function add_mt!(out, inp1, inp2) @inbounds @threads for i in eachindex(out); out[i] = inp1[i] + inp2[i]; end; end
function triad_mt!(out, inp1, inp2, s) @inbounds @threads for i in eachindex(out); out[i] = inp1[i] + s * inp2[i]; end; end
_copy! = use_threads ? copy_mt! : copy_st!
_scale! = use_threads ? scale_mt! : scale_st!
_add! = use_threads ? add_mt! : add_st!
_triad! = use_threads ? triad_mt! : triad_st!
times = Dict("Copy" => zeros(ntimes), "Scale" => zeros(ntimes), "Add" => zeros(ntimes), "Triad" => zeros(ntimes))
for trial in 1:ntimes
# Standard STREAM sequence:
# 1. c = a (Copy)
# 2. b = scalar * c (Scale)
# 3. c = a + b (Add)
# 4. a = b + scalar * c (Triad)
# To ensure inputs are somewhat consistent for each op type across trials,
# we could re-init a,b,c here, but let's follow the typical sequential dependency.
# For the very first trial, a,b,c are from the warmup.
# For subsequent trials, they hold results from the previous trial's last operation.
# This is common in how STREAM is often run.
times["Copy"][trial] = @elapsed _copy!(c, a)
times["Scale"][trial] = @elapsed _scale!(b, c, scalar)
times["Add"][trial] = @elapsed _add!(c, a, b)
times["Triad"][trial] = @elapsed _triad!(a, b, c, scalar)
end
bytes_per_array_val = sizeof(Float64) * array_size # Renamed to avoid conflict
bytes_transferred = Dict(
"Copy" => 2 * bytes_per_array_val, "Scale" => 2 * bytes_per_array_val,
"Add" => 3 * bytes_per_array_val, "Triad" => 3 * bytes_per_array_val
)
results = Dict{String, NamedTuple{(:MBs, :GBs, :AvgTimeMs), Tuple{Float64, Float64, Float64}}}()
slice_range = ntimes == 1 ? (1:1) : (2:ntimes) # Exclude first run if ntimes > 1
for name in ["Copy", "Scale", "Add", "Triad"]
avg_time_s = ntimes == 1 ? times[name][1] : mean(times[name][slice_range])
if avg_time_s > 0
mbs = (bytes_transferred[name] / avg_time_s) / 1e6
gbs = mbs / 1000.0
results[name] = (MBs=mbs, GBs=gbs, AvgTimeMs=avg_time_s * 1000)
else
results[name] = (MBs=NaN, GBs=NaN, AvgTimeMs=NaN)
end
end
println("STREAM Results ($thread_str):")
println("Number of trials = ", ntimes, (ntimes > 1 ? " (first trial excluded from averages)" : ""))
println("--------------------------------------------------------------------")
@printf("%-10s %12s %12s %15s\n", "Function", "MB/s", "GB/s", "Avg Time (ms)")
println("--------------------------------------------------------------------")
for name in ["Copy", "Scale", "Add", "Triad"]
res = results[name]
@printf("%-10s %12.2f %12.2f %15.4f\n", name, res.MBs, res.GBs, res.AvgTimeMs)
end
println("--------------------------------------------------------------------\n")
a = b = c = nothing; GC.gc()
return results # Return detailed results
end
# --- DGEMM Implementations ---
# 1. Octavian DGEMM: C = alpha*A*B + beta*C
function octavian_dgemm!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64}, alpha::Float64, beta::Float64)
Octavian.matmul!(C, A, B, alpha, beta)
return C
end
# 2. Wrapper for LinearAlgebra.mul! (used for Default BLAS)
function default_blas_dgemm!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64}, alpha::Float64, beta::Float64)
# LinearAlgebra.mul!(C, A, B, alpha, beta)
LinearAlgebra.BLAS.gemm!('N', 'N', alpha, A, B, beta, C)
return C
end
# 3. DGEMM via explicit MKL DLL call
function mkl_explicit_dgemm!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64}, alpha::Float64, beta::Float64)
M, N_ = size(C)
K = size(A, 2)
if !(size(A,1) == M && size(B,1) == K && size(B,2) == N_) error("Dimension mismatch") end
transa = 'N'
transb = 'N'
m = BlasInt(M)
n = BlasInt(N_)
k = BlasInt(K)
lda = BlasInt(max(1, stride(A,2)))
ldb = BlasInt(max(1, stride(B,2)))
ldc = BlasInt(max(1, stride(C,2)))
alpha_ref = Ref(alpha)
beta_ref = Ref(beta)
handle = C_NULL
try
handle = Libdl.dlopen(LIBMKL_PATH)
dgemm_sym = Libdl.dlsym(handle, :dgemm)
ccall(dgemm_sym, Cvoid,
(Ref{UInt8}, Ref{UInt8}, Ref{BlasInt}, Ref{BlasInt}, Ref{BlasInt},
Ref{Float64}, Ptr{Float64}, Ref{BlasInt}, Ptr{Float64}, Ref{BlasInt},
Ref{Float64}, Ptr{Float64}, Ref{BlasInt}),
transa, transb, m, n, k,
alpha_ref, A, lda, B, ldb,
beta_ref, C, ldc)
catch e
println("Error calling MKL dgemm via DLL: $e")
fill!(C, NaN)
finally
if handle != C_NULL
Libdl.dlclose(handle)
end
end
return C
end
# 4. DGEMM via explicit AOCL (LibBLIS) DLL call
function aocl_explicit_dgemm!(C::Matrix{Float64}, A::Matrix{Float64}, B::Matrix{Float64}, alpha::Float64, beta::Float64)
M, N_ = size(C)
K = size(A, 2)
if !(size(A,1) == M && size(B,1) == K && size(B,2) == N_) error("Dimension mismatch") end
transa = 'N'
transb = 'N'
m = BlasInt(M)
n = BlasInt(N_)
k = BlasInt(K)
lda = BlasInt(max(1, stride(A,2)))
ldb = BlasInt(max(1, stride(B,2)))
ldc = BlasInt(max(1, stride(C,2)))
alpha_ref = Ref(alpha)
beta_ref = Ref(beta)
handle = C_NULL
try
# Automatically select between ILP64 and LP64 based on matrix dimension
if n > typemax(Int32)
# println(" Using AOCL DGEMM (ILP64) for matrix size $n.")
handle = Libdl.dlopen(LIBBLIS_ILP64_PATH)
else
# println(" Using AOCL DGEMM (LP64) for matrix size $n.")
handle = Libdl.dlopen(LIBBLIS_LP64_PATH)
end
dgemm_sym = Libdl.dlsym(handle, :dgemm)
ccall(dgemm_sym, Cvoid,
(Ref{UInt8}, Ref{UInt8}, Ref{BlasInt}, Ref{BlasInt}, Ref{BlasInt},
Ref{Float64}, Ptr{Float64}, Ref{BlasInt}, Ptr{Float64}, Ref{BlasInt},
Ref{Float64}, Ptr{Float64}, Ref{BlasInt}),
transa, transb, m, n, k,
alpha_ref, A, lda, B, ldb,
beta_ref, C, ldc)
catch e
println("Error calling AOCL dgemm via DLL: $e")
fill!(C, NaN)
finally
if handle != C_NULL
Libdl.dlclose(handle)
end
end
return C
end
# --- DGEMM Benchmark Routine ---
function run_dgemm_benchmark(dgemm_func!::Function, backend_name::String, matrix_sizes::Vector{Int}, results_dict::Dict)
println("Running DGEMM Benchmark for: ", backend_name)
NTIMES_DGEMM = 5
REP_DGEMM_INNER = 16
alpha_val = 1.0
beta_val = 0.0
println(" Size | GFLOPS | Intensity (FLOPs/Byte)")
println("-------|----------|------------------------")
for s_idx in eachindex(matrix_sizes)
m_dim = matrix_sizes[s_idx]
n_dim = m_dim
p_dim = m_dim
A_mat = rand(Float64, m_dim, p_dim)
B_mat = rand(Float64, p_dim, n_dim)
C_mat = zeros(Float64, m_dim, n_dim)
try
dgemm_func!(C_mat, A_mat, B_mat, alpha_val, beta_val) # Warm-up
catch e
println("Warm-up failed for $backend_name, size $m_dim: $e")
if !haskey(results_dict, m_dim); results_dict[m_dim] = Dict{String, Float64}(); end
results_dict[m_dim][backend_name] = NaN
continue
end
t_min_local_s = Inf
valid_run = true
for iter in 1:NTIMES_DGEMM
t_start_ns = time_ns()
for rep_idx in 1:REP_DGEMM_INNER
try
dgemm_func!(C_mat, A_mat, B_mat, alpha_val, beta_val)
catch e
println("Error during $backend_name DGEMM (size $m_dim, iter $rep_idx): $e")
t_min_local_s = NaN
valid_run = false
break
end
end
if !valid_run; break; end
t_end_ns = time_ns()
elapsed_s_per_rep = ((t_end_ns - t_start_ns) / 1e9) / REP_DGEMM_INNER
if !isnan(t_min_local_s)
t_min_local_s = min(t_min_local_s, elapsed_s_per_rep)
end
end
gflops = NaN
intensity = NaN
if valid_run && !isnan(t_min_local_s) && t_min_local_s > 0 && isfinite(t_min_local_s)
flops_val = 2.0 * m_dim * n_dim * p_dim
bytes_val = (m_dim*p_dim + p_dim*n_dim + m_dim*n_dim) * sizeof(Float64)
gflops = (flops_val / t_min_local_s) / 1.0e9
intensity = flops_val / bytes_val
end
Printf.@printf(" %5d | %8.2f | %22.3f\n", m_dim, gflops, intensity)
if !haskey(results_dict, m_dim)
results_dict[m_dim] = Dict{String, Float64}()
end
results_dict[m_dim][backend_name] = gflops
end
println("---------------------------------------------------\n")
A_mat = B_mat = C_mat = nothing
GC.gc()
end
# --- Main Execution ---
function main()
# --- STREAM Benchmark Section ---
# Determine array size based on cache or use default
# Pass this determined size to both single and multi-threaded benchmarks
# Set ntimes for STREAM benchmark
stream_ntimes = 10 # Number of trials for STREAM
# Try to load Hwloc and determine array size, otherwise use a default.
# This needs to be done outside the function if the function has a default arg that calls it.
# Or, make determine_stream_array_size callable with a flag to suppress prints if called for default.
# For now, let's call it once here.
effective_stream_array_size = determine_stream_array_size(default_min_elements=20_000_000, cache_multiplier=4.0) # Default 20M elements, aim for 4x L3
println("\n--- Running Single-Threaded STREAM Benchmark ---")
stream_results_st = stream_benchmark_single_or_multi(array_size=effective_stream_array_size, ntimes=stream_ntimes, use_threads=false)
println("\n--- Running Multi-Threaded STREAM Benchmark ---")
stream_results_mt = stream_benchmark_single_or_multi(array_size=effective_stream_array_size, ntimes=stream_ntimes, use_threads=true)
# Print Combined STREAM Results Table
println("\nCombined STREAM Benchmark Memory Bandwidth Summary (GB/s):")
stream_header = ["Operation", "Single-Threaded (GB/s)", "Multi-Threaded (GB/s)"]
stream_table_data = Matrix{Any}(undef, 4, 3)
stream_ops = ["Copy", "Scale", "Add", "Triad"]
for (i, op_name) in enumerate(stream_ops)
stream_table_data[i, 1] = op_name
stream_table_data[i, 2] = Printf.@sprintf("%.2f", get(stream_results_st, op_name, (GBs=NaN,)).GBs)
stream_table_data[i, 3] = Printf.@sprintf("%.2f", get(stream_results_mt, op_name, (GBs=NaN,)).GBs)
end
pretty_table(stream_table_data, header=stream_header, alignment=:c)
println("---------------------------------------------------\n")
# --- DGEMM Benchmark Section ---
matrix_sizes = [128, 256, 512, 640, 768, 896, 1024, 2048, 4096]
dgemm_results = Dict{Int, Dict{String, Float64}}()
# 1. Octavian DGEMM
println("Ensuring Octavian is using available Julia threads for its operations.")
run_dgemm_benchmark(octavian_dgemm!, "Octavian", matrix_sizes, dgemm_results)
# 2. Default BLAS (likely OpenBLAS)
println("Current BLAS vendor before explicit loads: ", LinearAlgebra.BLAS.vendor())
println("BLAS config: ", LinearAlgebra.BLAS.get_config())
LinearAlgebra.BLAS.set_num_threads(Threads.nthreads())
println("Number of BLAS threads (BLAS.get_num_threads()): ", LinearAlgebra.BLAS.get_num_threads())
run_dgemm_benchmark(default_blas_dgemm!, "Default BLAS", matrix_sizes, dgemm_results)
# 3. MKL DGEMM via explicit DLL call
set_optimal_threads()
run_dgemm_benchmark(mkl_explicit_dgemm!, "MKL (DLL)", matrix_sizes, dgemm_results)
# 4. AOCL (LibBLIS) DGEMM via explicit DLL call
set_optimal_threads()
run_dgemm_benchmark(aocl_explicit_dgemm!, "AOCL (DLL)", matrix_sizes, dgemm_results)
# 5. LinearAlgebra.peakflops benchmark for each matrix size
println("\nRunning LinearAlgebra.peakflops Benchmark for various sizes...")
for m_size in matrix_sizes
# LinearAlgebra.peakflops returns GFLOPS. If the output is raw FLOPs, divide by 1e9.
peak_gflops_raw = try
LinearAlgebra.peakflops(m_size; eltype=Float64, ntrials=3, parallel=false)
catch e
@warn "LinearAlgebra.peakflops failed for size $m_size: $e"
NaN
end
# Assuming peak_gflops_raw might return raw FLOPs in some environments,
# convert to GFLOPS for consistent display.
peak_gflops = peak_gflops_raw / 1.0e9
Printf.@printf("Peak GFLOPS (n=%5d): %8.2f GFLOPS\n", m_size, peak_gflops)
if !haskey(dgemm_results, m_size)
dgemm_results[m_size] = Dict{String, Float64}()
end
dgemm_results[m_size]["Peakflops"] = peak_gflops
end
println("-----------------------------------------")
println("\nDGEMM GFLOPS Summary:")
header = ["Matrix Size", "Octavian", "Default BLAS", "AOCL (DLL)", "MKL (DLL)", "Peakflops"]
data_table = Matrix{Any}(undef, length(matrix_sizes), length(header))
for (i, m_size) in enumerate(matrix_sizes)
data_table[i, 1] = string(m_size, "x", m_size)
results_for_size = get(dgemm_results, m_size, Dict{String, Float64}())
data_table[i, 2] = Printf.@sprintf("%.2f", get(results_for_size, "Octavian", NaN))
data_table[i, 3] = Printf.@sprintf("%.2f", get(results_for_size, "Default BLAS", NaN))
data_table[i, 4] = Printf.@sprintf("%.2f", get(results_for_size, "AOCL (DLL)", NaN))
data_table[i, 5] = Printf.@sprintf("%.2f", get(results_for_size, "MKL (DLL)", NaN))
data_table[i, 6] = Printf.@sprintf("%.2f", get(results_for_size, "Peakflops", NaN)) # Add Peakflops column
end
pretty_table(data_table, header=header, alignment=:c)
println("\nBenchmark Complete.")
end
main()