Slow first run inside functions

sw2030 · January 14, 2019, 9:56pm

Hello,

It seems like the first run of norm(not only norm but also other functions) inside function is slower than the second run even after the first time compilation. This is even worse(100x times) in case of heavier functions like gmres.

julia> using LinearAlgebra, CuArrays, CUDAnative

julia> function foo(a)
           CUDAnative.@time norm(a)
           CUDAnative.@time norm(a)
       end
foo (generic function with 1 method)

julia> foo(CuArray(randn(1000, 1000))); #compile
  0.000163 seconds
  0.000100 seconds

julia> foo(CuArray(randn(1000, 1000)))
  0.000116 seconds
  0.000069 seconds
1000.6017409044356

Is there any way to get around this slowdown?

maleadt · January 15, 2019, 6:57am

Proper timings with BenchmarkTools show that the execution times are very stable. Why do you care about the first, cold and here very short execution time? Who knows what might be happening behind the scenes; cache warm-ups, device power state toggles, …

If as you say it also reproduces with much more expensive calls, where this might actually hurt performance, it is worth looking into. You could try wrapping those calls with NVTX markers (GitHub - maleadt/NVTX.jl: A Julia wrapper for the NVIDIA Tools Extension SDK.) and running under the NVIDIA profiler (nvprof) to see who’s to blaim.

sw2030 · January 15, 2019, 3:02pm

Thanks so much for the suggestion.

It actually hurts the performance of my implementation which is too long and complicated to post on discourse. I was tracking down the problem myself and found this is the analogous slowdown that causes slowdown to my code. The gmres implementation has some nested function calls and this causes more than 100x slowdown right now.

I will try to run nvprof to see what’s happening inside!

maleadt · January 15, 2019, 3:20pm

Since NVTX is pretty rough around the edges, here’s a demo:

using CuArrays, CUDAdrv
using LinearAlgebra
using NVTX

# no way to query the CUDA profiler state,
# so manually disable NVTX
NVTX.stop()

function foo(a)
   NVTX.@range "first" norm(a)
   NVTX.@range "second" norm(a)
end

CuArrays.@sync foo(CuArray(randn(1000, 1000)))

NVTX.@activate CUDAdrv.@profile CuArrays.@sync foo(CuArray(randn(1000, 1000)))

$ nvprof --profile-from-start off --unified-memory-profiling off julia wip.jl
==2515== NVPROF is profiling process 2515, command: julia wip.jl
==2515== Profiling application: julia wip.jl
==2515== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   79.75%  1.2797ms         3  426.55us     960ns  1.2777ms  [CUDA memcpy HtoD]
                   20.08%  322.28us         4  80.569us  34.080us  128.67us  void nrm2_kernel<double, double, double, int=0, int=0, int=128, int=0>(cublasNrm2Params<double, double>)
                    0.17%  2.6880us         2  1.3440us  1.1520us  1.5360us  [CUDA memcpy DtoH]
      API calls:   97.96%  85.572ms         4  21.393ms  6.8550us  85.543ms  cudaLaunchKernel
                    1.61%  1.4082ms         1  1.4082ms  1.4082ms  1.4082ms  cuMemcpyHtoD
                    0.39%  338.36us         4  84.589us  6.8030us  164.00us  cudaMemcpyAsync
                    0.01%  10.569us         2  5.2840us  3.5680us  7.0010us  cudaFuncGetAttributes
                    0.01%  5.4840us         2  2.7420us  2.1050us  3.3790us  cudaEventQuery
                    0.01%  5.3460us         2  2.6730us  1.6940us  3.6520us  cudaStreamSynchronize
                    0.00%  4.2530us         1  4.2530us  4.2530us  4.2530us  cuEventCreate
                    0.00%  3.2340us         2  1.6170us     933ns  2.3010us  cudaEventRecord
                    0.00%  2.6970us         8     337ns      88ns  1.0270us  cudaGetLastError
                    0.00%  1.2500us         1  1.2500us  1.2500us  1.2500us  cuEventSynchronize
                    0.00%     340ns         1     340ns     340ns     340ns  cuCtxGetCurrent

==2515== NVTX result:
==2515==   Thread "<unnamed>" (id = 1847488)
==2515==     Domain "<unnamed>"
==2515==       Range "first"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  85.762ms         1  85.762ms  85.762ms  85.762ms  first
 GPU activities:   98.49%  162.75us         2  81.377us  34.080us  128.67us  void nrm2_kernel<double, double, double, int=0, int=0, int=128, int=0>(cublasNrm2Params<double, double>)
                    0.93%  1.5360us         1  1.5360us  1.5360us  1.5360us  [CUDA memcpy DtoH]
                    0.58%     960ns         1     960ns     960ns     960ns  [CUDA memcpy HtoD]
      API calls:   99.80%  85.556ms         2  42.778ms  13.481us  85.543ms  cudaLaunchKernel
                    0.20%  167.55us         2  83.775us  16.726us  150.82us  cudaMemcpyAsync

==2515==       Range "second"
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
          Range:  100.00%  201.21us         1  201.21us  201.21us  201.21us  second
 GPU activities:   98.69%  159.52us         2  79.761us  34.113us  125.41us  void nrm2_kernel<double, double, double, int=0, int=0, int=128, int=0>(cublasNrm2Params<double, double>)
                    0.71%  1.1520us         1  1.1520us  1.1520us  1.1520us  [CUDA memcpy DtoH]
                    0.59%     960ns         1     960ns     960ns     960ns  [CUDA memcpy HtoD]
      API calls:   91.48%  170.81us         2  85.403us  6.8030us  164.00us  cudaMemcpyAsync
                    8.52%  15.913us         2  7.9560us  6.8550us  9.0580us  cudaLaunchKernel

This gives you a pretty clear view on what happens behind the scenes. To debug a performance issue like this, you can sprinkle NVTX ranges and markers around to focus on the code that matters.

sw2030 · January 15, 2019, 6:34pm

This is Awesome! Thank you so much for the help

rveltz · February 4, 2019, 12:22pm

Do you have a GMRES that works with CuArrays? I think a lot of people would be interested in this…

Topic		Replies	Views
Timing square function in CUDA GPU	4	1682	December 11, 2018
GPU kernel optimization (GPU vs CPU) GPU	3	1509	December 14, 2018
Delays shown in Nsight Systems between HtoD memcopy and kernel launch when using CUDA.jl GPU gpu , cudajl	9	247	July 31, 2024
CUDA v2 - performance regression on matrix multiplication GPU	14	1727	November 10, 2020
How to loop in script properly? (Perhaps CUDA specific) New to Julia	11	736	June 15, 2021

Slow first run inside functions

Related topics