It seems like the first run of norm(not only norm but also other functions) inside function is slower than the second run even after the first time compilation. This is even worse(100x times) in case of heavier functions like gmres.
julia> using LinearAlgebra, CuArrays, CUDAnative
julia> function foo(a)
CUDAnative.@time norm(a)
CUDAnative.@time norm(a)
end
foo (generic function with 1 method)
julia> foo(CuArray(randn(1000, 1000))); #compile
0.000163 seconds
0.000100 seconds
julia> foo(CuArray(randn(1000, 1000)))
0.000116 seconds
0.000069 seconds
1000.6017409044356
Proper timings with BenchmarkTools show that the execution times are very stable. Why do you care about the first, cold and here very short execution time? Who knows what might be happening behind the scenes; cache warm-ups, device power state toggles, …
If as you say it also reproduces with much more expensive calls, where this might actually hurt performance, it is worth looking into. You could try wrapping those calls with NVTX markers (GitHub - maleadt/NVTX.jl: A Julia wrapper for the NVIDIA Tools Extension SDK.) and running under the NVIDIA profiler (nvprof) to see who’s to blaim.
It actually hurts the performance of my implementation which is too long and complicated to post on discourse. I was tracking down the problem myself and found this is the analogous slowdown that causes slowdown to my code. The gmres implementation has some nested function calls and this causes more than 100x slowdown right now.
I will try to run nvprof to see what’s happening inside!
Since NVTX is pretty rough around the edges, here’s a demo:
using CuArrays, CUDAdrv
using LinearAlgebra
using NVTX
# no way to query the CUDA profiler state,
# so manually disable NVTX
NVTX.stop()
function foo(a)
NVTX.@range "first" norm(a)
NVTX.@range "second" norm(a)
end
CuArrays.@sync foo(CuArray(randn(1000, 1000)))
NVTX.@activate CUDAdrv.@profile CuArrays.@sync foo(CuArray(randn(1000, 1000)))
$ nvprof --profile-from-start off --unified-memory-profiling off julia wip.jl
==2515== NVPROF is profiling process 2515, command: julia wip.jl
==2515== Profiling application: julia wip.jl
==2515== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 79.75% 1.2797ms 3 426.55us 960ns 1.2777ms [CUDA memcpy HtoD]
20.08% 322.28us 4 80.569us 34.080us 128.67us void nrm2_kernel<double, double, double, int=0, int=0, int=128, int=0>(cublasNrm2Params<double, double>)
0.17% 2.6880us 2 1.3440us 1.1520us 1.5360us [CUDA memcpy DtoH]
API calls: 97.96% 85.572ms 4 21.393ms 6.8550us 85.543ms cudaLaunchKernel
1.61% 1.4082ms 1 1.4082ms 1.4082ms 1.4082ms cuMemcpyHtoD
0.39% 338.36us 4 84.589us 6.8030us 164.00us cudaMemcpyAsync
0.01% 10.569us 2 5.2840us 3.5680us 7.0010us cudaFuncGetAttributes
0.01% 5.4840us 2 2.7420us 2.1050us 3.3790us cudaEventQuery
0.01% 5.3460us 2 2.6730us 1.6940us 3.6520us cudaStreamSynchronize
0.00% 4.2530us 1 4.2530us 4.2530us 4.2530us cuEventCreate
0.00% 3.2340us 2 1.6170us 933ns 2.3010us cudaEventRecord
0.00% 2.6970us 8 337ns 88ns 1.0270us cudaGetLastError
0.00% 1.2500us 1 1.2500us 1.2500us 1.2500us cuEventSynchronize
0.00% 340ns 1 340ns 340ns 340ns cuCtxGetCurrent
==2515== NVTX result:
==2515== Thread "<unnamed>" (id = 1847488)
==2515== Domain "<unnamed>"
==2515== Range "first"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 85.762ms 1 85.762ms 85.762ms 85.762ms first
GPU activities: 98.49% 162.75us 2 81.377us 34.080us 128.67us void nrm2_kernel<double, double, double, int=0, int=0, int=128, int=0>(cublasNrm2Params<double, double>)
0.93% 1.5360us 1 1.5360us 1.5360us 1.5360us [CUDA memcpy DtoH]
0.58% 960ns 1 960ns 960ns 960ns [CUDA memcpy HtoD]
API calls: 99.80% 85.556ms 2 42.778ms 13.481us 85.543ms cudaLaunchKernel
0.20% 167.55us 2 83.775us 16.726us 150.82us cudaMemcpyAsync
==2515== Range "second"
Type Time(%) Time Calls Avg Min Max Name
Range: 100.00% 201.21us 1 201.21us 201.21us 201.21us second
GPU activities: 98.69% 159.52us 2 79.761us 34.113us 125.41us void nrm2_kernel<double, double, double, int=0, int=0, int=128, int=0>(cublasNrm2Params<double, double>)
0.71% 1.1520us 1 1.1520us 1.1520us 1.1520us [CUDA memcpy DtoH]
0.59% 960ns 1 960ns 960ns 960ns [CUDA memcpy HtoD]
API calls: 91.48% 170.81us 2 85.403us 6.8030us 164.00us cudaMemcpyAsync
8.52% 15.913us 2 7.9560us 6.8550us 9.0580us cudaLaunchKernel
This gives you a pretty clear view on what happens behind the scenes. To debug a performance issue like this, you can sprinkle NVTX ranges and markers around to focus on the code that matters.