Track function on profiler to CUDA documentation

Below, I have a profile report for a experiment I’m trying. The most expensive device function is cusparse::csrmv_v3_transpose_kernel, so I tried to google the function name to read more but couldn’t find it. From CUDA.versioninfo(), the CUSPARSE lib being used is 12.5.4, so I tried searching for it in the docs of 12.5 (I searched “csrmv_v3_transpose_kernel” and also “csrmv”), but couldn’t find it either. My question is, how the functions being printed in the profiler relates to the documentations available, either at CUDA or at CUDA.jl?

Host-side activity: calling CUDA APIs took 55.25 ms (41.36% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬─────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                        │
├──────────┼────────────┼───────┼───────────────────────────────────────┼─────────────────────────────┤
│   23.04% │   30.77 ms │   204 │ 150.85 µs ± 15.6   ( 59.84 ‥ 169.28)  │ cudaMemcpyAsync             │
│    4.27% │     5.7 ms │  1109 │   5.14 µs ± 1.1    (  3.58 ‥ 31.71)   │ cuLaunchKernel              │
│    2.94% │    3.93 ms │  1016 │   3.87 µs ± 1.71   (  2.62 ‥ 33.14)   │ cudaLaunchKernel            │
│    2.87% │    3.83 ms │  1213 │   3.16 µs ± 0.91   (  1.43 ‥ 16.69)   │ cuMemAllocFromPoolAsync     │
│    1.60% │    2.14 ms │   202 │  10.59 µs ± 0.87   (  9.54 ‥ 16.93)   │ cuMemcpyDtoHAsync           │
│    0.49% │  657.32 µs │   204 │   3.22 µs ± 0.54   (  2.62 ‥ 9.78)    │ cudaFuncGetAttributes       │
│    0.42% │  555.28 µs │   202 │   2.75 µs ± 0.18   (  2.38 ‥ 3.81)    │ cuMemFreeAsync              │
│    0.41% │  541.93 µs │   204 │   2.66 µs ± 0.25   (  1.91 ‥ 4.53)    │ cudaStreamSynchronize       │
│    0.30% │  396.25 µs │   404 │ 980.82 ns ± 343.96 (476.84 ‥ 6437.3)  │ cuStreamSynchronize         │
│    0.23% │  312.81 µs │  1020 │ 306.67 ns ± 127.72 (   0.0 ‥ 1192.09) │ cudaStreamGetCaptureInfo_v2 │
│    0.16% │  217.91 µs │   204 │   1.07 µs ± 0.54   (  0.72 ‥ 7.87)    │ cudaEventRecord             │
│    0.08% │  101.09 µs │   818 │ 123.58 ns ± 138.13 (   0.0 ‥ 953.67)  │ cudaGetLastError            │
└──────────┴────────────┴───────┴───────────────────────────────────────┴─────────────────────────────┘

Device-side activity: GPU was busy for 44.64 ms (33.42% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name                                                                                                                                                                                         ⋯
├──────────┼────────────┼───────┼───────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│   17.99% │   24.03 ms │   202 │ 118.94 µs ± 1.31   (115.87 ‥ 122.79)  │ void cusparse::csrmv_v3_transpose_kernel<int, int, double, double, double, double, void>(cusparse::KernelCoeffs<double>, int const*, int const*, int const*, double const*, int, int, int, d ⋯
│    8.47% │   11.31 ms │   404 │  27.99 µs ± 13.81  ( 14.07 ‥ 42.92)   │ void nrm2_kernel<double, double, double, 0, 0, 128>(cublasNrm2Params<int, double, double>)                                                                                                   ⋯
│    1.91% │    2.55 ms │   403 │   6.33 µs ± 1.31   (  4.53 ‥ 7.87)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo ⋯
│    1.46% │    1.95 ms │   202 │   9.67 µs ± 0.25   (  9.06 ‥ 10.49)   │ partial_mapreduce_grid(identity, add_sum, Float64, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>, Val<true>, CuDeviceArray<Float64, 2, 1>, CuDeviceArra ⋯
│    0.98% │    1.31 ms │   202 │   6.48 µs ± 0.19   (  5.96 ‥ 6.91)    │ void cusparse::csrmv_v3_partition_kernel<std::integral_constant<bool, false>, 256, int, int, double, double, double>(int const*, int, int, int, int*)                                        ⋯
│    0.89% │    1.19 ms │   202 │   5.88 µs ± 0.28   (  4.53 ‥ 6.68)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo ⋯
│    0.69% │   928.4 µs │   202 │    4.6 µs ± 0.4    (  4.05 ‥ 6.68)    │ partial_mapreduce_grid(identity, add_sum, Float64, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, Val<true>, CuDeviceArray< ⋯
│    0.48% │  634.91 µs │   100 │   6.35 µs ± 0.25   (  5.72 ‥ 6.68)    │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo ⋯
│    0.30% │  394.82 µs │   406 │ 972.47 ns ± 170.65 (715.26 ‥ 1668.93) │ [copy device to pageable memory]                                                                                                                                                             ⋯
│    0.21% │   276.8 µs │   202 │   1.37 µs ± 0.16   (  1.19 ‥ 1.67)    │ void cusparse::vector_scalar_multiply_kernel<256, cusparse::AlignedVectorScalarMultiplyPolicy, int, double, double>(cusparse::KernelCoeff<double>, int, double*)                             ⋯
│    0.03% │   39.82 µs │     1 │                                       │ void gen_sequenced<curandStateXORWOW, double2, normal_args_double_st, &double2 curand_normal_scaled2_double<curandStateXORWOW>(curandStateXORWOW*, normal_args_double_st), rng_config<curand ⋯
│    0.01% │    17.4 µs │     2 │    8.7 µs ± 0.17   (  8.58 ‥ 8.82)    │ void dot_kernel<double, 128, 0, cublasDotParams<cublasGemvTensor<double const>, cublasGemvTensorStridedBatched<double>>>(cublasDotParams<cublasGemvTensor<double const>, cublasGemvTensorStr ⋯
│    0.01% │    8.34 µs │     2 │   4.17 µs ± 0.51   (  3.81 ‥ 4.53)    │ void reduce_1Block_kernel<double, 128, 7, cublasGemvTensorStridedBatched<double>, cublasGemvTensorStridedBatched<double>, cublasGemvTensorStridedBatched<double>>(double const*, double, cub ⋯
│    0.00% │    3.34 µs │     1 │                                       │ void scal_kernel_val<double, double>(cublasScalParamsVal<double, double>)                                                                                                                    ⋯
└──────────┴────────────┴───────┴───────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                   

The names you’re seeing are the names of the kernels. In the case of native Julia kernels, you should be able to find those kernels in the respective source code. Here you’re calling into the binary CUSPARSE library, and executing a kernel from it, whose name does not necessarily correspond to the API function that we called.

To figure out where this call comes from, you can annotate your source code with NVTX ranges using NVTX.jl’s @annotate. That will group the kernels being called by the NVTX range they were part of, allowing you to narrow down where the operations were submitted.

There’s cusparseCsrmvEx, though? We don’t call that kernel ourselves, so it’s probably triggered by another CUSPARSE operation.

(post deleted by author)