I am trying to benchmark and improve the cost for matrix multiplication for a custom type I have implemented. But, in trying to do that, I have become confused about the profile and benchmarking results for a simple matrix multiplication MWE below. As you can see, the time shown in the profile view window is off by and order of magnitude from the true benchmarking time. I am concerned that somehow I am missing calls in the profile return? I am also struggling to understand what the dgemm_kernel_ZEN call means and why it is taking so much time in the flame graph.
using BenchmarkTools, LinearAlgebra, Random
BLAS.set_num_threads(1)
using Profile, ProfileSVG
Profile.init(n = 10^8, delay = 10^(-7)) #delay is in seconds
ProfileSVG.set_default(bgcolor=:transparent,timeunit=:ms,maxframes=1_000_000,maxdepth=3000);
rng = MersenneTwister(2023)
A = randn(rng,7453,7453)
B = randn(rng,7453,50)
C = randn(rng,7453,50);
function mat_mul!(C::Matrix{Float64},A::Matrix{Float64},B::Matrix{Float64})
mul!(C,A,B)
return
end
mat_mul!(C,A,B)
@benchmark mat_mul!(C,A,B)
Profile.clear()
@profile mat_mul!(C,A,B)
ProfileSVG.view(C=true)
No, though I should have done that in the original example. Thanks for pointing that out. The benchmarking result is unchanged for
@benchmark mat_mul!($C,$A,$B)
I am fairly sure that the 163 ms is the “right” timing that I would expect for this machine and size of matrices. It is the profile results not matching up and having such a large fraction of the time being spent in this kernel call that is really confusing me.
I have not tried another machine, but I have tried calling profile and benchmark multiple times. There are small GC-esque variations, but all of the features I discussed above that are causing me discomfort reproduce.
I just tried this test on my machine.The @benchmark results are comparable, I get about 200 ms. For the dgemm_kernelProfileSVG shows 0.039 ms. Even when I just tried @profile sleep(1) the reported time is far less than 1 second. I suppose its best to trust BenchmarkTools with the timing aspect.
This is not a good idea. Use a sensible delay and profile a suitable set of iterations.
It looks as if the profiler is sampling idle OpenBLAS threads. Other profile outputs (e.g. Profile.print) show a large count disconnected from the execution graph.
Furthermore, the normalization of profiler counts has (always?) been a bit weird, so there may be a fudge factor even when there is consistency.
I see. I should have realized that there was a reasonable lower limit on delay. After restoring it to 10^(-3) and just iterating 100 times, the total ProfileView and Benchmark times are consistent (after multiplying by the number of iterations of course).
Can you elaborate on what a lot of samplings of idle OpenBLAS threads might mean or what to do to better understand what is going on?
I was wrong about the counts in the DGEMM kernel code; I now think they are real but don’t get properly attached to the graph because the kernel is called in such a way that the profiler doesn’t get the information needed for a normal stack trace. That is, the time associated with it should really be added to the stack on the left in your original plot.
Is it worth opening up an issue on the Profile package? It seems not great that a simple mul! call is not connected to the graph correctly. Personally, a lot of my runtime is matrix multiplication, so this may bother me more than most.
There’s already at least one issue: here
It’s not practical to modify Profile to handle cases like this.
But if your use case is just the basic GEMM, then you can use the Profile-friendly matmul_serial! from Octavian instead of mul! for these studies.