Stack overflow on cuda

question

#1

I’m trying to write a radix sort in julia using cuda but I get a stack overflow error. I’ve got it down to a bare minimum kernel and am not sure what’s wrong with it.

function increment_with_block_sums(d_predicateScan::CuDeviceMatrix{T}, d_blockSumScan::CuDeviceMatrix{T}, numElems::Integer) where {T}
    id = blockDim().x * (blockIdx().x - 1) + threadIdx().x

    if (id <= numElems)
        d_predicateScan[id, :] = d_predicateScan[id, :] + d_blockSumScan[blockIdx().x, :]
    end
    return
    
end
function gpu_exclusive_scan(input::CuArray{T}) where {T}

    numElems, num_features = size(input)
    gridSize = trunc(Int64, ceil(numElems/blockSize))
    block_sum = CuArray{T}(gridSize, num_features)
    d_total = CuArray{T}(1, num_features)

    # @cuda (gridSize, blockSize, blockSize * num_features * sizeof(T)) partial_exclusive_scan(input, block_sum, numElems)
    # @cuda (1, blockSize, blockSize * num_features * sizeof(T)) partial_exclusive_scan(block_sum, d_total, gridSize)
    @cuda (gridSize, blockSize) increment_with_block_sums(input, block_sum, numElems)
    return
end

rows = 5
cols = 4
a = rand(Int, rows, cols)
gpu_a = CuArray(a)
gpu_exclusive_scan(gpu_a)

Julia Version 0.6.2
Commit d386e40* (2017-12-13 18:08 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel® Core™ i5-7200U CPU @ 2.50GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT NO_AFFINITY NEHALEM)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, broadwell)


#2

What is your Pkg.status()? When I run your code I get the error UndefVarError: blockSize not defined


#3

Sorry blockSize=1024 is a global variable defined above.
I got it to work by changing the kernel to

function increment_with_block_sums(d_predicateScan::CuDeviceMatrix{T}, d_blockSumScan::CuDeviceMatrix{T}, numElems::Integer, num_features::Integer) where {T}
    id = blockDim().x * (blockIdx().x - 1) + threadIdx().x

    if (id <= numElems)
        for i in 1:num_features
            d_predicateScan[id, i] += d_blockSumScan[blockIdx().x, i]
        end
    end
    return
    
end

Thanks! I didn’t realize vector additions aren’t supported.


#4

Ah yes, I was suspecting that was the case. The problem is that d_predicateScan[id, :] is creating a copy and is thereby allocating. In Julia 0.7 I think we are support views on the GPU so that should allow you to not have the loop spelled out.


#5

Thanks for clarifying. Now I have another strange error.

function partial_exclusive_scan(d_list::CuDeviceMatrix{T}, d_block_sums::CuDeviceMatrix{T}, numElems::Integer, num_features::Integer) where {T}
    

    tid = threadIdx().x
    id = blockDim().x * (blockIdx().x - 1) + threadIdx().x

    s_block_scan = @cuDynamicSharedMem(T, blockSize * num_features)

    if (id > numElems)
        for feature_id in 1:num_features
            s_block_scan[(tid - 1) * num_features + feature_id] = 0
        end
    else
        for feature_id in 1:num_features
            s_block_scan[(tid - 1) * num_features + feature_id] = d_list[id, feature_id]
        end
    end

    sync_threads()
return
end

Returns multiple warnings of encountered incompatible llvm ir and an error of ERROR: LoadError: LLVM IR generated for partial_exclusive_scan(CUDAnative.CuDeviceArray{Int64,2,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Int64,2,CUDAnative.AS.Global}, Int64, Int64) at capability 5.0.0 is not compatible.

I’m used to Cuda on C++ and am working on a school project to compare Julia and C++ on a machine learning problem so am fairly new to this language. Sorry if it’s trivial :slight_smile:


#6

I had to replace blockSize with blockDim().x to make it work in the kernel :slight_smile:


#7

In general

LLVM IR generated for ... at capability 5.0.0 is not compatible.

is a sign that your kernel is using a language feature that is not supported on the GPU.
In general we try to give you information about what language feature is used, but we have little information about where the usage is coming from.

Was blockSize a non-const global?


#8

Adding to @vchuravy’s comment, getting an invalid IR error can be expected when you use badly typed or unsupported code. However, that should never lead to a stack overflow as you mention in your post title. If that does happen, please file an issue with a minimal working example (you should also include that in Discourse posts like this one, because it’s much harder to help without a working example + details on the actual error).


#9

No it was a global constant. may be because it isn’t on the gpu (sent to kernel specifically) it doesn’t work?


#10

That is false.

julia> using CUDAdrv, CUDAnative

julia> const foobar = 42
42

julia> const da = CuArray([1])
1-element CuArray{Int64,1}:
 1

julia> @cuda ((a)->a[1]=foobar)(da)

julia> da
1-element CuArray{Int64,1}:
 42

(note that the above example uses some features that are only available in CUDAnative on Julia 0.7)

Speculation about what might have gone wrong without actual code doesn’t seem productive at this point. But feel free to open an issue if you encounter the issue again :slightly_smiling_face: