ProfileView and Benchmark Disagree on Timing for Simple Matrix Multiplication

andrew-saydjari · March 14, 2023, 4:59pm

I am trying to benchmark and improve the cost for matrix multiplication for a custom type I have implemented. But, in trying to do that, I have become confused about the profile and benchmarking results for a simple matrix multiplication MWE below. As you can see, the time shown in the profile view window is off by and order of magnitude from the true benchmarking time. I am concerned that somehow I am missing calls in the profile return? I am also struggling to understand what the dgemm_kernel_ZEN call means and why it is taking so much time in the flame graph.

using BenchmarkTools, LinearAlgebra, Random
BLAS.set_num_threads(1)
using Profile, ProfileSVG
Profile.init(n = 10^8, delay = 10^(-7)) #delay is in seconds
ProfileSVG.set_default(bgcolor=:transparent,timeunit=:ms,maxframes=1_000_000,maxdepth=3000);

rng = MersenneTwister(2023)
A = randn(rng,7453,7453)
B = randn(rng,7453,50)
C = randn(rng,7453,50);

function mat_mul!(C::Matrix{Float64},A::Matrix{Float64},B::Matrix{Float64})
    mul!(C,A,B)
    return
end
mat_mul!(C,A,B)
@benchmark mat_mul!(C,A,B)

Profile.clear()
@profile mat_mul!(C,A,B)
ProfileSVG.view(C=true)

Oscar_Smith · March 14, 2023, 5:07pm

I believe the difference is that profiling adds time across all cores while timing uses real time.

andrew-saydjari · March 14, 2023, 5:10pm

But I have explicitly set the number of BLAS threads to 1, so I don’t understand how the real time can be less than the benchmarked 163 ms?

gdalle · March 14, 2023, 5:30pm

Does it change something when you use interpolation for global variables in benchmarking?

andrew-saydjari · March 14, 2023, 5:35pm

No, though I should have done that in the original example. Thanks for pointing that out. The benchmarking result is unchanged for

@benchmark mat_mul!($C,$A,$B)

I am fairly sure that the 163 ms is the “right” timing that I would expect for this machine and size of matrices. It is the profile results not matching up and having such a large fraction of the time being spent in this kernel call that is really confusing me.

gdalle · March 14, 2023, 6:48pm

Is it reproducible?

andrew-saydjari · March 14, 2023, 6:50pm

I have not tried another machine, but I have tried calling profile and benchmark multiple times. There are small GC-esque variations, but all of the features I discussed above that are causing me discomfort reproduce.

mihalybaci · March 14, 2023, 7:19pm

I just tried this test on my machine.The @benchmark results are comparable, I get about 200 ms. For the dgemm_kernel ProfileSVG shows 0.039 ms. Even when I just tried @profile sleep(1) the reported time is far less than 1 second. I suppose its best to trust BenchmarkTools with the timing aspect.

Ralph_Smith · March 15, 2023, 3:22am

This is not a good idea. Use a sensible delay and profile a suitable set of iterations.

It looks as if the profiler is sampling idle OpenBLAS threads. Other profile outputs (e.g. Profile.print) show a large count disconnected from the execution graph.

Furthermore, the normalization of profiler counts has (always?) been a bit weird, so there may be a fudge factor even when there is consistency.

andrew-saydjari · March 15, 2023, 3:48am

I see. I should have realized that there was a reasonable lower limit on delay. After restoring it to 10^(-3) and just iterating 100 times, the total ProfileView and Benchmark times are consistent (after multiplying by the number of iterations of course).

Can you elaborate on what a lot of samplings of idle OpenBLAS threads might mean or what to do to better understand what is going on?

Ralph_Smith · March 16, 2023, 3:48am

I was wrong about the counts in the DGEMM kernel code; I now think they are real but don’t get properly attached to the graph because the kernel is called in such a way that the profiler doesn’t get the information needed for a normal stack trace. That is, the time associated with it should really be added to the stack on the left in your original plot.

andrew-saydjari · March 16, 2023, 4:18pm

Is it worth opening up an issue on the Profile package? It seems not great that a simple mul! call is not connected to the graph correctly. Personally, a lot of my runtime is matrix multiplication, so this may bother me more than most.

Ralph_Smith · March 17, 2023, 2:00am

There’s already at least one issue: here
It’s not practical to modify Profile to handle cases like this.
But if your use case is just the basic GEMM, then you can use the Profile-friendly matmul_serial! from Octavian instead of mul! for these studies.

andrew-saydjari · March 28, 2023, 3:28am

FWIW, BLISBLAS also seems to be correctly linked in the Profiler and also bought me a tidy speedup for my specific problem.

Topic		Replies	Views
Inaccurate sampling of profiler? General Usage question	4	497	October 19, 2019
Profiling doesn't count matrix products correctly General Usage blas , profiling	1	674	May 29, 2020
CUDA v2 - performance regression on matrix multiplication GPU	14	1765	November 10, 2020
Fluctuations when measuring execution time of linear algebra code Performance	12	1067	July 19, 2019
Spurious allocation? General Usage question	6	1291	May 16, 2017

ProfileView and Benchmark Disagree on Timing for Simple Matrix Multiplication

Related topics