Profiling doesn't count matrix products correctly

Here’s a function that should spend exactly half of its time doing a matrix product, the other half sleeping:

function f(A, B)
    starttime = time()
    C = A*B
    # C = sort(A[:])
    # C = svd(A)
    endtime = time()
    elapsed = endtime - starttime
    sleep(elapsed)
    return elapsed
end 

I make a couple of medium-sized matrices, and profile calling f on them, and the profiler tells me that roughly 6% of time was spent in the matrix product, the rest in sleep. If I switch the matrix product for something like a bit of sorting or an SVD, everything works as expected, and the profiler reports roughly 50/50 sleeping and working.

I’m using the OpenBLAS that comes with Julia. Here’s a bit from the output of Profile.print:

13058 @Base/asyncevent.jl:213; sleep
.
.
.
743   ...LinearAlgebra/src/matmul.jl:160; *
   724   ...LinearAlgebra/src/matmul.jl:208; mul!
  ╎ 724   ...inearAlgebra/src/matmul.jl:169; mul!
  ╎  724   ...inearAlgebra/src/matmul.jl:597; gemm_wrapper!(::Array{Float64,2}, ::Char...
  ╎   724   ...LinearAlgebra/src/blas.jl:1167; gemm!(::Char, ::Char, ::Float64, ::Arra...
  ╎    724   ...ib/julia/libopenblas64_.so:?; dgemm_64_
  ╎     724   ...b/julia/libopenblas64_.so:?; dgemm_thread_nn
  ╎    ╎ 724   ...b/julia/libopenblas64_.so:?; gemm_driver
  ╎    ╎  724   ...b/julia/libopenblas64_.so:?; exec_blas
  ╎    ╎   23    .../julia/libopenblas64_.so:?; exec_blas_async_wait
  ╎    ╎    21    ...6_64-linux-gnu/libc.so.6:?; __sched_yield
  ╎    ╎    1     ...inux-gnu/libpthread.so.0:?; pthread_mutex_lock
  ╎    ╎    1     ...inux-gnu/libpthread.so.0:?; pthread_mutex_unlock
  ╎    ╎   701   .../julia/libopenblas64_.so:?; inner_thread
  ╎    ╎    2     .../julia/libopenblas64_.so:?; 
  ╎    ╎    61    .../julia/libopenblas64_.so:?; dgemm_beta_HASWELL
  ╎    ╎    70    .../julia/libopenblas64_.so:?; dgemm_itcopy_HASWELL
  ╎    ╎    65    .../julia/libopenblas64_.so:?; dgemm_oncopy_HASWELL
  ╎    ╎    494   ...6_64-linux-gnu/libc.so.6:?; __sched_yield

Is there a reason why this would happen, or should I consider this a bug and file a report?

Posted a bit hastily, since a few minutes later I found this: https://github.com/JuliaLang/julia/issues/33605

Seems that what’s happening is that the BLAS calls just don’t get assigned to the proper callers. Indeed, when I look at Profile.print(; C=true), at the bottom, outside the usual call tree are my missing BLAS counts (from different run than the above, so numbers don’t match):

4316╎4316  ...-1.4.0/bin/../lib/julia/libopenblas64_.so:?; dgemm_kernel_HASWELL

These don’t appear without C=true, which is why I at first didn’t spot them at all.

1 Like