Below, I have a profile report for a experiment I’m trying. The most expensive device function is cusparse::csrmv_v3_transpose_kernel
, so I tried to google the function name to read more but couldn’t find it. From CUDA.versioninfo()
, the CUSPARSE lib being used is 12.5.4, so I tried searching for it in the docs of 12.5 (I searched “csrmv_v3_transpose_kernel” and also “csrmv”), but couldn’t find it either. My question is, how the functions being printed in the profiler relates to the documentations available, either at CUDA or at CUDA.jl?
Host-side activity: calling CUDA APIs took 55.25 ms (41.36% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬─────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼───────────────────────────────────────┼─────────────────────────────┤
│ 23.04% │ 30.77 ms │ 204 │ 150.85 µs ± 15.6 ( 59.84 ‥ 169.28) │ cudaMemcpyAsync │
│ 4.27% │ 5.7 ms │ 1109 │ 5.14 µs ± 1.1 ( 3.58 ‥ 31.71) │ cuLaunchKernel │
│ 2.94% │ 3.93 ms │ 1016 │ 3.87 µs ± 1.71 ( 2.62 ‥ 33.14) │ cudaLaunchKernel │
│ 2.87% │ 3.83 ms │ 1213 │ 3.16 µs ± 0.91 ( 1.43 ‥ 16.69) │ cuMemAllocFromPoolAsync │
│ 1.60% │ 2.14 ms │ 202 │ 10.59 µs ± 0.87 ( 9.54 ‥ 16.93) │ cuMemcpyDtoHAsync │
│ 0.49% │ 657.32 µs │ 204 │ 3.22 µs ± 0.54 ( 2.62 ‥ 9.78) │ cudaFuncGetAttributes │
│ 0.42% │ 555.28 µs │ 202 │ 2.75 µs ± 0.18 ( 2.38 ‥ 3.81) │ cuMemFreeAsync │
│ 0.41% │ 541.93 µs │ 204 │ 2.66 µs ± 0.25 ( 1.91 ‥ 4.53) │ cudaStreamSynchronize │
│ 0.30% │ 396.25 µs │ 404 │ 980.82 ns ± 343.96 (476.84 ‥ 6437.3) │ cuStreamSynchronize │
│ 0.23% │ 312.81 µs │ 1020 │ 306.67 ns ± 127.72 ( 0.0 ‥ 1192.09) │ cudaStreamGetCaptureInfo_v2 │
│ 0.16% │ 217.91 µs │ 204 │ 1.07 µs ± 0.54 ( 0.72 ‥ 7.87) │ cudaEventRecord │
│ 0.08% │ 101.09 µs │ 818 │ 123.58 ns ± 138.13 ( 0.0 ‥ 953.67) │ cudaGetLastError │
└──────────┴────────────┴───────┴───────────────────────────────────────┴─────────────────────────────┘
Device-side activity: GPU was busy for 44.64 ms (33.42% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Time (%) │ Total time │ Calls │ Time distribution │ Name ⋯
├──────────┼────────────┼───────┼───────────────────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ 17.99% │ 24.03 ms │ 202 │ 118.94 µs ± 1.31 (115.87 ‥ 122.79) │ void cusparse::csrmv_v3_transpose_kernel<int, int, double, double, double, double, void>(cusparse::KernelCoeffs<double>, int const*, int const*, int const*, double const*, int, int, int, d ⋯
│ 8.47% │ 11.31 ms │ 404 │ 27.99 µs ± 13.81 ( 14.07 ‥ 42.92) │ void nrm2_kernel<double, double, double, 0, 0, 128>(cublasNrm2Params<int, double, double>) ⋯
│ 1.91% │ 2.55 ms │ 403 │ 6.33 µs ± 1.31 ( 4.53 ‥ 7.87) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo ⋯
│ 1.46% │ 1.95 ms │ 202 │ 9.67 µs ± 0.25 ( 9.06 ‥ 10.49) │ partial_mapreduce_grid(identity, add_sum, Float64, CartesianIndices<1, Tuple<OneTo<Int64>>>, CartesianIndices<1, Tuple<OneTo<Int64>>>, Val<true>, CuDeviceArray<Float64, 2, 1>, CuDeviceArra ⋯
│ 0.98% │ 1.31 ms │ 202 │ 6.48 µs ± 0.19 ( 5.96 ‥ 6.91) │ void cusparse::csrmv_v3_partition_kernel<std::integral_constant<bool, false>, 256, int, int, double, double, double>(int const*, int, int, int, int*) ⋯
│ 0.89% │ 1.19 ms │ 202 │ 5.88 µs ± 0.28 ( 4.53 ‥ 6.68) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo ⋯
│ 0.69% │ 928.4 µs │ 202 │ 4.6 µs ± 0.4 ( 4.05 ‥ 6.68) │ partial_mapreduce_grid(identity, add_sum, Float64, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, CartesianIndices<2, Tuple<OneTo<Int64>, OneTo<Int64>>>, Val<true>, CuDeviceArray< ⋯
│ 0.48% │ 634.91 µs │ 100 │ 6.35 µs ± 0.25 ( 5.72 ‥ 6.68) │ gpu_broadcast_kernel_linear(CompilerMetadata<DynamicSize, DynamicCheck, void, CartesianIndices<1, Tuple<OneTo<Int64>>>, NDRange<1, DynamicSize, DynamicSize, CartesianIndices<1, Tuple<OneTo ⋯
│ 0.30% │ 394.82 µs │ 406 │ 972.47 ns ± 170.65 (715.26 ‥ 1668.93) │ [copy device to pageable memory] ⋯
│ 0.21% │ 276.8 µs │ 202 │ 1.37 µs ± 0.16 ( 1.19 ‥ 1.67) │ void cusparse::vector_scalar_multiply_kernel<256, cusparse::AlignedVectorScalarMultiplyPolicy, int, double, double>(cusparse::KernelCoeff<double>, int, double*) ⋯
│ 0.03% │ 39.82 µs │ 1 │ │ void gen_sequenced<curandStateXORWOW, double2, normal_args_double_st, &double2 curand_normal_scaled2_double<curandStateXORWOW>(curandStateXORWOW*, normal_args_double_st), rng_config<curand ⋯
│ 0.01% │ 17.4 µs │ 2 │ 8.7 µs ± 0.17 ( 8.58 ‥ 8.82) │ void dot_kernel<double, 128, 0, cublasDotParams<cublasGemvTensor<double const>, cublasGemvTensorStridedBatched<double>>>(cublasDotParams<cublasGemvTensor<double const>, cublasGemvTensorStr ⋯
│ 0.01% │ 8.34 µs │ 2 │ 4.17 µs ± 0.51 ( 3.81 ‥ 4.53) │ void reduce_1Block_kernel<double, 128, 7, cublasGemvTensorStridedBatched<double>, cublasGemvTensorStridedBatched<double>, cublasGemvTensorStridedBatched<double>>(double const*, double, cub ⋯
│ 0.00% │ 3.34 µs │ 1 │ │ void scal_kernel_val<double, double>(cublasScalParamsVal<double, double>) ⋯
└──────────┴────────────┴───────┴───────────────────────────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────