Stack overflow on cuda

swethmandava · May 5, 2018, 7:38pm

I’m trying to write a radix sort in julia using cuda but I get a stack overflow error. I’ve got it down to a bare minimum kernel and am not sure what’s wrong with it.

function increment_with_block_sums(d_predicateScan::CuDeviceMatrix{T}, d_blockSumScan::CuDeviceMatrix{T}, numElems::Integer) where {T}
    id = blockDim().x * (blockIdx().x - 1) + threadIdx().x

    if (id <= numElems)
        d_predicateScan[id, :] = d_predicateScan[id, :] + d_blockSumScan[blockIdx().x, :]
    end
    return
    
end
function gpu_exclusive_scan(input::CuArray{T}) where {T}

    numElems, num_features = size(input)
    gridSize = trunc(Int64, ceil(numElems/blockSize))
    block_sum = CuArray{T}(gridSize, num_features)
    d_total = CuArray{T}(1, num_features)

    # @cuda (gridSize, blockSize, blockSize * num_features * sizeof(T)) partial_exclusive_scan(input, block_sum, numElems)
    # @cuda (1, blockSize, blockSize * num_features * sizeof(T)) partial_exclusive_scan(block_sum, d_total, gridSize)
    @cuda (gridSize, blockSize) increment_with_block_sums(input, block_sum, numElems)
    return
end

rows = 5
cols = 4
a = rand(Int, rows, cols)
gpu_a = CuArray(a)
gpu_exclusive_scan(gpu_a)

Julia Version 0.6.2
Commit d386e40* (2017-12-13 18:08 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core™ i5-7200U CPU @ 2.50GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT NO_AFFINITY NEHALEM)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, broadwell)

vchuravy · May 5, 2018, 7:51pm

What is your Pkg.status()? When I run your code I get the error UndefVarError: blockSize not defined

swethmandava · May 5, 2018, 8:06pm

Sorry blockSize=1024 is a global variable defined above.
I got it to work by changing the kernel to

function increment_with_block_sums(d_predicateScan::CuDeviceMatrix{T}, d_blockSumScan::CuDeviceMatrix{T}, numElems::Integer, num_features::Integer) where {T}
    id = blockDim().x * (blockIdx().x - 1) + threadIdx().x

    if (id <= numElems)
        for i in 1:num_features
            d_predicateScan[id, i] += d_blockSumScan[blockIdx().x, i]
        end
    end
    return
    
end

Thanks! I didn’t realize vector additions aren’t supported.

vchuravy · May 5, 2018, 8:18pm

Ah yes, I was suspecting that was the case. The problem is that d_predicateScan[id, :] is creating a copy and is thereby allocating. In Julia 0.7 I think we are support views on the GPU so that should allow you to not have the loop spelled out.

swethmandava · May 5, 2018, 8:42pm

Thanks for clarifying. Now I have another strange error.

function partial_exclusive_scan(d_list::CuDeviceMatrix{T}, d_block_sums::CuDeviceMatrix{T}, numElems::Integer, num_features::Integer) where {T}
    

    tid = threadIdx().x
    id = blockDim().x * (blockIdx().x - 1) + threadIdx().x

    s_block_scan = @cuDynamicSharedMem(T, blockSize * num_features)

    if (id > numElems)
        for feature_id in 1:num_features
            s_block_scan[(tid - 1) * num_features + feature_id] = 0
        end
    else
        for feature_id in 1:num_features
            s_block_scan[(tid - 1) * num_features + feature_id] = d_list[id, feature_id]
        end
    end

    sync_threads()
return
end

Returns multiple warnings of encountered incompatible llvm ir and an error of ERROR: LoadError: LLVM IR generated for partial_exclusive_scan(CUDAnative.CuDeviceArray{Int64,2,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Int64,2,CUDAnative.AS.Global}, Int64, Int64) at capability 5.0.0 is not compatible.

I’m used to Cuda on C++ and am working on a school project to compare Julia and C++ on a machine learning problem so am fairly new to this language. Sorry if it’s trivial

swethmandava · May 6, 2018, 12:56am

I had to replace blockSize with blockDim().x to make it work in the kernel

vchuravy · May 6, 2018, 3:30pm

In general

LLVM IR generated for ... at capability 5.0.0 is not compatible.

is a sign that your kernel is using a language feature that is not supported on the GPU.
In general we try to give you information about what language feature is used, but we have little information about where the usage is coming from.

Was blockSize a non-const global?

maleadt · May 8, 2018, 7:14am

Adding to @vchuravy’s comment, getting an invalid IR error can be expected when you use badly typed or unsupported code. However, that should never lead to a stack overflow as you mention in your post title. If that does happen, please file an issue with a minimal working example (you should also include that in Discourse posts like this one, because it’s much harder to help without a working example + details on the actual error).

swethmandava · May 9, 2018, 6:06am

No it was a global constant. may be because it isn’t on the gpu (sent to kernel specifically) it doesn’t work?

maleadt · May 9, 2018, 8:02am

That is false.

julia> using CUDAdrv, CUDAnative

julia> const foobar = 42
42

julia> const da = CuArray([1])
1-element CuArray{Int64,1}:
 1

julia> @cuda ((a)->a[1]=foobar)(da)

julia> da
1-element CuArray{Int64,1}:
 42

(note that the above example uses some features that are only available in CUDAnative on Julia 0.7)

Speculation about what might have gone wrong without actual code doesn’t seem productive at this point. But feel free to open an issue if you encounter the issue again

Topic		Replies	Views
GPU Sort Function GPU question , gpuarrays , sort	20	4814	April 2, 2020
Kernel Compilation error- KernelError: recursion is currently not supported GPU	8	1310	November 8, 2019
Problem with GPU programming GPU cudanative , cuda	4	1057	September 13, 2019
Parallel calculations with CUDA General Usage gpu , cuda	0	571	December 12, 2019
Bug in CUDA, CuArray, or something I just don't know? GPU	3	264	December 25, 2022

Stack overflow on cuda

Related topics