Is is possible to merge multiple kernels in CUDAnative to minimize launch overhead and execution overhead?

I defined several inplace operations on a vector, but the overhead is significant.

julia> using CuArrays, CUDAnative, LinearAlgebra, BenchmarkTools

julia> function ms1!(X::CuArray, a, b, c)
           function kernel(X, a, b, c)
               i = (blockIdx().x-1) * blockDim().x + threadIdx().x
               @inbounds X[i] *= a
               @inbounds X[i] *= b
               @inbounds X[i] *= c
               return
           end
           @cuda blocks=length(X)÷256 threads=256 kernel(X, a, b, c)
           X
       end;

julia> function ms2!(X::CuArray, s::Number)
           function kernel(X, s)
               i = (blockIdx().x-1) * blockDim().x + threadIdx().x
               @inbounds X[i] *= s
               return
           end
           @cuda blocks=length(X)÷256 threads=256 kernel(X, s)
           X
       end;

julia> cua = randn(1<<10) |> cu;

julia> @benchmark ms1!($cua, 1.001, 1.001, 1.001)
BenchmarkTools.Trial: 
  memory estimate:  704 bytes
  allocs estimate:  24
  --------------
  minimum time:     4.885 μs (0.00% GC)
  median time:      5.087 μs (0.00% GC)
  mean time:        6.417 μs (17.86% GC)
  maximum time:     8.653 ms (99.85% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark ms2!(ms2!(ms2!($cua, 1.001), 1.001), 1.001)
BenchmarkTools.Trial: 
  memory estimate:  1.59 KiB
  allocs estimate:  54
  --------------
  minimum time:     14.188 μs (0.00% GC)
  median time:      14.982 μs (0.00% GC)
  mean time:        15.555 μs (0.00% GC)
  maximum time:     90.658 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

The launch and execution overheads are dominating here.
Is it possible to merge multiple kernels either automatically or manually at run time?

Note: This is a demo, in real applications, operations are not element-wise, merging broadcasts will not help

No, kernel merging isn’t being considered right now. On CUDA 9 (or 10) and up kernel launch overheads should have improved though.

Also try profiling (with nvprof) to compare the time spent in API calls with the observed host-time of these functions, if there’s a large disparity there might an issue with the host code generated by CUDAnative.

Try broadcasting over an array wrapper that implements the indexing semantics you want?

2 Likes

Thanks, these suggestions are very helpful.

I am using CUDA 8 now, I will switch to CUDA 10 and test my program again.

I have tried CUDA 10.0, but didn’t see any significant improvement.

I should probably refactor my codes to generate more flexible kernels. :stuck_out_tongue:

Thank you all the same, @maleadt

I am trying to merge multiple kernels like this, but got an error

function kernel(blk::Union{ChainBlock, Sequential})
    ks = Tuple(kernel(bi) for bi in subblocks(blk))  # A tuple of simple functions
    function kf(state, inds)
        for kfi in ks
            kfi(state, inds)   # call each kernel one by one
            CuArrays.sync_threads()
        end
    end
end

# some test code gives the following error message

InvalidIRError: compiling simple_kernel(getfield(Main, Symbol("#kf#56")){Tuple{getfield(Main, Symbol("#kernel#19")){getfield(Yao.Intrinsics, Symbol("##22#23")){Int64,Int64},Int64},getfield(Main, Symbol("#kernel#13")){SArray{Tuple{2,2},Complex{Float64},2,4},SArray{Tuple{2},Int64,1,2},getfield(Yao.Intrinsics, Symbol("##22#23")){Int64,Int64}},getfield(Main, Symbol("#kernel#13")){SArray{Tuple{2,2},Complex{Float64},2,4},SArray{Tuple{2},Int64,1,2},getfield(Yao.Intrinsics, Symbol("##22#23")){Int64,Int64}}}}, CuDeviceArray{Complex{Float64},2,CUDAnative.AS.Global}) resulted in invalid LLVM IR
Reason: unsupported call to the Julia runtime (call to jl_gc_pool_alloc)
Stacktrace:
 [1] kf at /home/leo/.julia/dev/CuYao/src/gcompile.jl:14

Will code generation help here?

some kernels swap elements at different positions, is it possible for broadcasting?

1 Like

Not sure what you’re trying to do with that example.

I guess it’s possible, but the interface isn’t going to be great. We used to have a map-like abstraction that passed the container and an index to the op, Remove mapidx. by maleadt · Pull Request #158 · JuliaGPU/GPUArrays.jl · GitHub, but we removed that since it was unused (in an effort to make GPUArrays/CuArrays more like the Base interface). Also wasn’t broadcast-based so wouldn’t fuse.

I have seen the removed codes, this is exactly what I want.:stuck_out_tongue_winking_eye:

Maybe this mapidx is useful? although the interface is a bit non intuitive.

Although my CUDA program has a 35 times speed up with respect to CPU.

99% of time are spent in launching kernels… The following is the nvprof result.

➜  modules git:(GPUdemo) ✗ nvprof julia QCBMS.jl
==22279== NVPROF is profiling process 22279, command: julia QCBMS.jl
==22279== Profiling application: julia QCBMS.jl
==22279== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   70.36%  77.0104s    810000  95.074us  74.113us  279.72us  ptxcall_simple_kernel_2
                   28.96%  31.6927s    720000  44.017us  32.896us  113.19us  ptxcall_simple_kernel_3
                    0.68%  748.96ms     10000  74.895us  72.801us  79.361us  ptxcall_anonymous23_1
                    0.00%  1.1371ms         4  284.27us  1.7600us  1.0389ms  [CUDA memcpy HtoD]
      API calls:   99.11%  90.5692s   1540000  58.811us  6.5610us  9.6723ms  cuLaunchKernel
                    0.43%  389.37ms   1540034     252ns     145ns  649.65us  cuCtxGetCurrent
                    0.23%  210.94ms         1  210.94ms  210.94ms  210.94ms  cuCtxCreate
                    0.14%  129.13ms         1  129.13ms  129.13ms  129.13ms  cuCtxDestroy
                    0.07%  65.987ms         3  21.996ms  47.171us  65.891ms  cuModuleUnload
                    0.01%  13.700ms        27  507.41us  439.26us  724.08us  cuMemAlloc
                    0.00%  2.5056ms         3  835.19us  348.68us  1.7719ms  cuModuleLoadDataEx
                    0.00%  1.4557ms         4  363.94us  43.000us  1.1706ms  cuMemcpyHtoD
                    0.00%  36.489us         8  4.5610us  3.6320us  8.1710us  cuDeviceGetPCIBusId
                    0.00%  15.972us        30     532ns     167ns  2.4170us  cuDeviceGetAttribute
                    0.00%  9.0610us         9  1.0060us     283ns  4.6000us  cuDeviceGet
                    0.00%  3.2120us         3  1.0700us  1.0430us  1.0890us  cuModuleGetFunction
                    0.00%  2.6260us         3     875ns     707ns  1.0060us  cuCtxGetDevice
                    0.00%  2.4400us         1  2.4400us  2.4400us  2.4400us  cuDriverGetVersion
                    0.00%  2.0020us         2  1.0010us     282ns  1.7200us  cuDeviceGetCount

I find this post about generic GPU kernels very useful,
https://mikeinnes.github.io/2017/08/24/cudanative.html

You shouldn’t be launching 1540000 kernels. Just write your own CUDAnative kernel at this point, if the broadcast abstraction doesn’t cut it.

1 Like

I was trying to deploy my program to servers, but find the following code break under CUDA 9.x, it only works for CUDA 10.0.
Versions of CUDA related Julia packages are fine tuned to be same, and the generated device codes are same.

Is this a platform related issue? How can I fix it?

function measure!(reg::GPUReg{B, T}) where {B, T}
      regm = reg |> rank3   # returns a tensor of order 3
      pl = dropdims(mapreduce(abs2, +, regm, dims=2), dims=2)
      pl_cpu = pl |> Matrix
      res = CuArray(map(ib->_measure(view(pl_cpu, :, ib), 1)[], 1:B))
      gpu_call(regm, (regm, res, pl)) do state, regm, res, pl
         k,i,j = @cartesianidx regm
         @inbounds rind = res[j] + 1
         @inbounds regm[rind,i,j] = k==rind ? regm[rind,i,j]/sqrt(pl[rind, j]) : T(0)
         return
      end
      res
  end

@device_code_warntype throws the following error

ERROR: GPU compilation of #11(CuArrays.CuKernelState, CuDeviceArray{Complex{Float64},3,CUDAnative.AS.Global}, CuDeviceArray{Int64,1,CUDAnative.AS.Global}, CuDeviceArray{Float64,2,CUDAnative.AS.Global}) failed
KernelError: recursion is currently not supported

Try inspecting the generated code with any of the @device_code_... macros.

Stacktrace:
 [1] #IOBuffer#300 at iobuffer.jl:112
 [2] print_to_string at strings/io.jl:112
 [3] #IOBuffer#299 at iobuffer.jl:91
 [4] #IOBuffer#300 at iobuffer.jl:112
 [5] print_to_string at strings/io.jl:112
 [6] throw_complex_domainerror at math.jl:31
 [7] #11 at /home/jgliu/.julia/dev/CuYao/src/GPUReg.jl:57
Stacktrace:
 [1] (::getfield(CUDAnative, Symbol("#hook_emit_function#58")){CUDAnative.CompilerContext,Array{Core.MethodInstance,1}})(::Core.MethodInstance, ::Core.CodeInfo, ::UInt64) at /home/jgliu/.julia/packages/CUDAnative/jRnXY/src/compiler/irgen.jl:97
 [2] irgen(::CUDAnative.CompilerContext) at /home/jgliu/.julia/packages/CUDAnative/jRnXY/src/compiler/irgen.jl:133
 [3] #compile#78(::Bool, ::Function, ::CUDAnative.CompilerContext) at ./logging.jl:308
 [4] compile at /home/jgliu/.julia/packages/CUDAnative/jRnXY/src/compiler/driver.jl:49 [inlined]
 [5] #compile#77(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::CUDAdrv.CuDevice, ::Any, ::Any) at /home/jgliu/.julia/packages/CUDAnative/jRnXY/src/compiler/driver.jl:28
 [6] compile at /home/jgliu/.julia/packages/CUDAnative/jRnXY/src/compiler/driver.jl:16 [inlined]
 [7] macro expansion at /home/jgliu/.julia/packages/CUDAnative/jRnXY/src/execution.jl:259 [inlined]
 [8] #cufunction#89(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(cufunction), ::getfield(CuYao, Symbol("##11#13")){Complex{Float64}}, ::Type{Tuple{CuArrays.CuKernelState,CuDeviceArray{Complex{Float64},3,CUDAnative.AS.Global},CuDeviceArray{Int64,1,CUDAnative.AS.Global},CuDeviceArray{Float64,2,CUDAnative.AS.Global}}}) at /home/jgliu/.julia/packages/CUDAnative/jRnXY/src/execution.jl:234
 [9] cufunction(::Function, ::Type) at /home/jgliu/.julia/packages/CUDAnative/jRnXY/src/execution.jl:234
 [10] macro expansion at /home/jgliu/.julia/packages/CUDAnative/jRnXY/src/execution.jl:202 [inlined]
 [11] macro expansion at ./gcutils.jl:87 [inlined]
 [12] macro expansion at /home/jgliu/.julia/packages/CUDAnative/jRnXY/src/execution.jl:199 [inlined]
 [13] _gpu_call(::CuArrays.CuArrayBackend, ::Function, ::CuArray{Complex{Float64},3}, ::Tuple{CuArray{Complex{Float64},3},CuArray{Int64,1},CuArray{Float64,2}}, ::Tuple{Tuple{Int64},Tuple{Int64}}) at /home/jgliu/.julia/dev/CuArrays/src/gpuarray_interface.jl:59
 [14] gpu_call(::Function, ::CuArray{Complex{Float64},3}, ::Tuple{CuArray{Complex{Float64},3},CuArray{Int64,1},CuArray{Float64,2}}, ::Int64) at /home/jgliu/.julia/dev/GPUArrays/src/abstract_gpu_interface.jl:151
 [15] gpu_call at /home/jgliu/.julia/dev/GPUArrays/src/abstract_gpu_interface.jl:128 [inlined]
 [16] measure!(::DefaultRegister{1000,Complex{Float64},CuArray{Complex{Float64},2}}) at /home/jgliu/.julia/dev/CuYao/src/GPUReg.jl:56
 [17] top-level scope at /home/jgliu/.julia/packages/CUDAnative/jRnXY/src/reflection.jl:153 [inlined]
 [18] top-level scope at ./none:0

this line is related to the error.

@inbounds regm[rind,i,j] = k==rind ? regm[rind,i,j]/sqrt(pl[rind, j]) : T(0)

Should I post an issue on github?

Finally, I narrowed down the problem to sqrt function, this is strange

Fixed in Julia 1.0.2. Julia 1.0.1 triggers recursion error in `throw_overflowerr_binaryop` · Issue #278 · JuliaGPU/CUDAnative.jl · GitHub

What exactly is the issue with sqrt? You might want to consider using CUDAnative.sqrt, not all base math ops are GPU compatible.

It works! Thanks~

I debug in a native way, I removed the sqrt, and no complaints.

As this issue mentioned, this is related to Julia version 1.0.1 and 1.0.2, rather than CUDA version. I should probably also upgrade my julia version. :stuck_out_tongue: