Synchronizing Cuda kernels

I have two cuda functions in my code:

@cuda blocks=3 threads=numberofsegments+1 divideLine(Segment,numberofsegments+1,SegmentsCalculated)
@cuda blocks=lenX,lenY,lenZ threads=numberofsegments,2,1 biotGPU(x,y,z,SegmentsCalculated,Bx,By,Bz, Current)

I don’t want to go into details, but I am almost sure that there is the problem with number of threads, However from that I calculate I use totally 1021 threads which isn’t the limit(the limit is 1024, and I used 341+340*2=1021 threads).
The numberofsegments variable depending on my input, so for small enough inputs, it works fine, but when numberofsegments is about 1000 threads, my program crash.
And that’s the error:

ERROR: LoadError: CuError(701, nothing)
Stacktrace:
 [1] (::getfield(CUDAdrv, Symbol("##25#26")){Bool,Int64,CuStream,CuFunction})(::Array{Ptr{Nothing},1}) at C:\Users\Wiktor\.julia\packages\CUDAdrv\ADRHQ\src\base.jl:145
 [2] macro expansion at .\gcutils.jl:87 [inlined]
 [3] macro expansion at C:\Users\Wiktor\.julia\packages\CUDAdrv\ADRHQ\src\execution.jl:61 [inlined]
 [4] pack_arguments(::getfield(CUDAdrv, Symbol("##25#26")){Bool,Int64,CuStream,CuFunction}, ::CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::CuDeviceArray{Float32,2,CUDAnative.AS.Global}, ::CuDeviceArray{Float32,2,CUDAnative.AS.Global}, ::CuDeviceArray{Float32,2,CUDAnative.AS.Global}, ::CuDeviceArray{Float32,1,CUDAnative.AS.Global}) at C:\Users\Wiktor\.julia\packages\CUDAdrv\ADRHQ\src\execution.jl:40
 [5] #launch#24(::Tuple{Int64,Int64,Int64}, ::Tuple{Int64,Int64,Int64}, ::Bool, ::Int64, ::CuStream, ::Function, ::CuFunction, ::CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::Vararg{Any,N} where N) at C:\Users\Wiktor\.julia\packages\CUDAdrv\ADRHQ\src\execution.jl:90
 [6] #launch at .\none:0 [inlined]
 [7] #30 at C:\Users\Wiktor\.julia\packages\CUDAdrv\ADRHQ\src\execution.jl:179 [inlined]
 [8] macro expansion at .\gcutils.jl:87 [inlined]
 [9] macro expansion at C:\Users\Wiktor\.julia\packages\CUDAdrv\ADRHQ\src\execution.jl:139 [inlined]
 [10] convert_arguments at C:\Users\Wiktor\.julia\packages\CUDAdrv\ADRHQ\src\execution.jl:123 [inlined]
 [11] #cudacall#29 at C:\Users\Wiktor\.julia\packages\CUDAdrv\ADRHQ\src\execution.jl:178 [inlined]
 [12] #cudacall at .\none:0 [inlined]
 [13] #cudacall#160 at C:\Users\Wiktor\.julia\packages\CUDAnative\nItlk\src\execution.jl:279 [inlined]
 [14] #cudacall at .\none:0 [inlined]
 [15] macro expansion at C:\Users\Wiktor\.julia\packages\CUDAnative\nItlk\src\execution.jl:260 [inlined]
 [16] #call#148(::Base.Iterators.Pairs{Symbol,Tuple{Int64,Int64,Int64},Tuple{Symbol,Symbol},NamedTuple{(:blocks, :threads),Tuple{Tuple{Int64,Int64,Int64},Tuple{Int64,Int64,Int64}}}}, ::typeof(CUDAnative.call), ::CUDAnative.HostKernel{ConvertCoordinates.biotGPU,Tuple{CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,2,CUDAnative.AS.Global},CuDeviceArray{Float32,2,CUDAnative.AS.Global},CuDeviceArray{Float32,2,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global}}}, ::CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::CuDeviceArray{Float32,2,CUDAnative.AS.Global}, ::CuDeviceArray{Float32,2,CUDAnative.AS.Global}, ::CuDeviceArray{Float32,2,CUDAnative.AS.Global}, ::CuDeviceArray{Float32,1,CUDAnative.AS.Global}) at C:\Users\Wiktor\.julia\packages\CUDAnative\nItlk\src\execution.jl:237
 [17] (::getfield(CUDAnative, Symbol("#kw##call")))(::NamedTuple{(:blocks, :threads),Tuple{Tuple{Int64,Int64,Int64},Tuple{Int64,Int64,Int64}}}, ::typeof(CUDAnative.call), ::CUDAnative.HostKernel{ConvertCoordinates.biotGPU,Tuple{CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,2,CUDAnative.AS.Global},CuDeviceArray{Float32,2,CUDAnative.AS.Global},CuDeviceArray{Float32,2,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global}}}, ::CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::Vararg{Any,N} where N) at .\none:0
 [18] #call#163(::Base.Iterators.Pairs{Symbol,Tuple{Int64,Int64,Int64},Tuple{Symbol,Symbol},NamedTuple{(:blocks, :threads),Tuple{Tuple{Int64,Int64,Int64},Tuple{Int64,Int64,Int64}}}}, ::CUDAnative.HostKernel{ConvertCoordinates.biotGPU,Tuple{CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,2,CUDAnative.AS.Global},CuDeviceArray{Float32,2,CUDAnative.AS.Global},CuDeviceArray{Float32,2,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global}}}, ::CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::Vararg{Any,N} where N) at C:\Users\Wiktor\.julia\packages\CUDAnative\nItlk\src\execution.jl:406
 [19] (::getfield(CUDAnative, Symbol("#kw#HostKernel")))(::NamedTuple{(:blocks, :threads),Tuple{Tuple{Int64,Int64,Int64},Tuple{Int64,Int64,Int64}}}, ::CUDAnative.HostKernel{ConvertCoordinates.biotGPU,Tuple{CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global},CuDeviceArray{Float32,2,CUDAnative.AS.Global},CuDeviceArray{Float32,2,CUDAnative.AS.Global},CuDeviceArray{Float32,2,CUDAnative.AS.Global},CuDeviceArray{Float32,1,CUDAnative.AS.Global}}}, ::CuDeviceArray{Float32,1,CUDAnative.AS.Global}, ::Vararg{Any,N} where N) at .\none:0
 [20] macro expansion at .\gcutils.jl:87 [inlined]
 [21] macro expansion at C:\Users\Wiktor\.julia\packages\CUDAnative\nItlk\src\execution.jl:171 [inlined]
 [22] PrepareArrangement(::Base.RefValue{Bool}) at c:\Users\Wiktor\MagneticField3DGPUVersionReal\src\generateMap.jl:102
 [23] MainMenu(::Base.RefValue{Bool}) at c:\Users\Wiktor\MagneticField3DGPUVersionReal\src\mainMenu.jl:42
 [24] top-level scope at c:\Users\Wiktor\MagneticField3DGPUVersionReal\src\MagneticField3D.jl:102
 [25] include_string(::Module, ::String, ::String) at .\loading.jl:1008
 [26] (::getfield(Main._vscodeserver, Symbol("##9#12")){String,Int64,Int64,String})() at c:\Users\Wiktor\.vscode\extensions\julialang.language-julia-0.12.2\scripts\terminalserver\terminalserver.jl:153
 [27] withpath(::getfield(Main._vscodeserver, Symbol("##9#12")){String,Int64,Int64,String}, ::String) at c:\Users\Wiktor\.vscode\extensions\julialang.language-julia-0.12.2\scripts\terminalserver\repl.jl:62
 [28] (::getfield(Main._vscodeserver, Symbol("##8#11")){String,Int64,Int64,String})() at c:\Users\Wiktor\.vscode\extensions\julialang.language-julia-0.12.2\scripts\terminalserver\terminalserver.jl:152
 [29] hideprompt(::getfield(Main._vscodeserver, Symbol("##8#11")){String,Int64,Int64,String}) at c:\Users\Wiktor\.vscode\extensions\julialang.language-julia-0.12.2\scripts\terminalserver\repl.jl:28
 [30] macro expansion at c:\Users\Wiktor\.vscode\extensions\julialang.language-julia-0.12.2\scripts\terminalserver\terminalserver.jl:148 [inlined]
 [31] (::getfield(Main._vscodeserver, Symbol("##7#10")))() at .\task.jl:259
in expression starting at c:\Users\Wiktor\MagneticField3DGPUVersionReal\src\MagneticField3D.jl:89

102nd line from 22nd Stacktrace is this second function line.

I checked all and I am not sure that it is problem with number of threads. For 961 threads(numberofsegmets=320) it works perfect. So the second function must be runned when all threads used in first function will be free, but to be honest I have a problem with that. I tried

CuArrays.@sync begin
@cuda blocks=3 threads=numberofsegments+1 divideLine(Segment,numberofsegments+1,SegmentsCalculated)
@cuda blocks=lenX,lenY,lenZ threads=numberofsegments,1,1 biotGPU(x,y,z,SegmentsCalculated,Bx,By,Bz, Current)
end

but it didn’t work. I know this is some elementary thing, but I have problem with that.

Which version of CUDAdrv/Julia are you using? You should never see CuError(701, nothing), but ERROR_LAUNCH_OUT_OF_RESOURCES instead.

So yeah, you’re exhausting resources of the GPU (either in terms of registers, shared memory, or plainly exceeding the maximum number of threads or blocks you’re allowed to launch in each direction). You can use the APIs to query these limits.

Properties of the device:

julia> using CUDAdrv, CUDAnative

julia> dev = device()
CuDevice(0): GeForce GTX 970

julia> attribute(dev, CUDAdrv.MAX_BLOCK_DIM_X)
1024

Properties of a compiled kernel:

julia> function vadd(a, b, c)
           i = (blockIdx().x-1) * blockDim().x + threadIdx().x
           c[i] = a[i] + b[i]
           return
       end
vadd (generic function with 1 method)

julia> kernel = cufunction(vadd, NTuple{3,CuDeviceArray{Float32,2,AS.Global}})
[ Info: Building the CUDAnative run-time library for your sm_52 device, this might take a while...
CUDAnative.HostKernel{vadd,Tuple{CuDeviceArray{Float32,2,CUDAnative.AS.Global},CuDeviceArray{Float32,2,CUDAnative.AS.Global},CuDeviceArray{Float32,2,CUDAnative.AS.Global}}}(CuContext(Ptr{Nothing} @0x0000000001aa00b0, false, true), CuModule(Ptr{Nothing} @0x00000000043ac7c0, CuContext(Ptr{Nothing} @0x0000000001aa00b0, false, true)), CuFunction(Ptr{Nothing} @0x0000000004464aa0, CuModule(Ptr{Nothing} @0x00000000043ac7c0, CuContext(Ptr{Nothing} @0x0000000001aa00b0, false, true))))

julia> CUDAnative.registers(kernel)
22

julia> CUDAnative.memory(kernel)
(local = 104, shared = 0, constant = 0)

julia> CUDAnative.maxthreads(kernel)
1024

Or simply use the occupancy API to have CUDA pick a number of threads, and update your indexing to be able to handle that:

# adjust the kernel to perform a bounds check
function vadd(a, b, c)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    if i <= length(c)
        c[i] = a[i] + b[i]
    end
    return
end

# generate data
...

# old hardcoded launch
len = prod(dims)
#@cuda threads=len vadd(d_a, d_b, d_c)

# instead use a callback to query the allowed & optimal number of threads
function get_config(kernel)
    fun = kernel.fun
    config = launch_configuration(fun)

    blocks = cld(len, config.threads)
    return (threads=config.threads, blocks=blocks)
end
@cuda config=get_config vadd(d_a, d_b, d_c)
1 Like

Sorry, that I didn’t write so long. If it’s going about device properties I know everything. If somebody interested:

using CUDAdrv

println("Name of device: $(CuDevice(0))")

println("Total amount of memory on the device.: $(totalmem(CuDevice(0)))")

for i=1:85

println("$(CUDAdrv.CUdevice_attribute(i)) : $(attribute(CuDevice(0), CUDAdrv.CUdevice_attribute(i)))")

end

If it is about my problem is to use any number of threads(ofcourse not in one function). I want to run one function and after this function do it’s calculations I want it to free threads and run second function.
It must be possible, because I run these to functions in for loop so it should work.
Pseudocode:

a=[rand(4),rand(4),rand(4)]
c=[rand(4),rand(4),rand(4)]
Table=CuArray{Float32}(undef,lengthOfTable*length(a))
for i=1:length(a)
b=cu(a[i])
d=cu(c[i])
@cuda blocks=numberOfBlocks threads=numberOfThreads someFunction(Table,b)
@cuda blocks=numberOfBlocks threads=numberOfThreads someSecondFunction(Table,d)
# these (hypothetical)functions modify only 'Table Array', but need these b and d Arrays in Calculations
end

And it works for numberOfThreads=512(max threads=1024)(or less, because in my project even for 1020 threads(that is 340*2(first function)+340(second function)=1020) it didn’t worked, but it worked for 960 threads and it worked even in loop).

I founded something interesting.
I tried to wrote a function, but it don’t works.

function bench_biot(x,y,z,Segment, Bx,By,Bz, Current,numberofsegments,SegmentsCalculated)
  CuArrays.@sync begin
    @cuda blocks=3 threads=numberofsegments+1 divideLine(Segment,numberofsegments+1,SegmentsCalculated)
    @cuda blocks=length(x),length(y),length(z) threads=numberofsegments,1,1 biotGPU(x,y,z,SegmentsCalculated,Bx,By,Bz, Current)
  end

However I obtained that for numberofsegments=640 it works and 641+640=1281>1024.

What is interesting about this? Read my comment again, there’s two factors at play: device properties, and resources such as register usage. You can create a kernel that can only be launched on at most 1 thread if it uses ample registers. Use the occupancy API or kernel introspection to figure that out.