Compiling kernel resulted in invalid LLVM IR Reason: unsupported dynamic function invocation

So I am trying to write an extreme value/ order statistics package for GPUs. The idea is, that getting the maximum of a large number of random variables multiple times in order to get a max distribution should lend itself to parallelization.

But to allow for arbitrary distributions I want my function to accept a pseudoinverse. So I attempted this

import CUDA
import CUDA: @cuda

function sample_extreme_values(sampleSize, superSampleSize, pseudoInverse)::Float32
    numblocks = ceil(Int, superSampleSize/256)
    gpu_res = CUDA.CuArray{Float32}(undef, superSampleSize)
    cpu_res = Array{Float32}(undef, superSampleSize)
    @cuda threads=256 blocks=numblocks gpu_parallel!(gpu_res, pseudoInverse, sampleSize)
    copyto!(cpu_res, gpu_res)
    return cpu_res
end

function gpu_parallel!(results, pseudoinverse, sampleSize)
    index = (CUDA.blockIdx().x - 1) * CUDA.blockDim().x + CUDA.threadIdx().x
    stride = CUDA.blockDim().x * CUDA.gridDim().x
    for thread  = index:stride:length(results)
        generator = RVGenerator(pseudoinverse, sampleSize)
        results[thread] = largest(generator)
    end
end


function largest(generator)
    largest = Float32(0)
    for rv = generator
        largest = max(largest, rv)
    end
    return largest
end


# Implementation of Widynski, Bernard (2020). 
# "Squares: A Fast Counter-Based RNG". arXiv:2004.06278v2

# key (seed) taken from keys.h (line 2193) in the distribution of Squares
# see https://squaresrng.wixsite.com/rand

# this distribution also includes a generator for these keys - eventually
# this hardcoded key should be replaced with such a generator 
# (one key generates 2^64 random numbers)
key = 0x86d47f132b79acfd

@inline function squares_rng(counter::UInt64, seed::UInt64)::UInt32
    yy = counter * seed
    z = yy + seed
    xx = yy * (yy+1)
    # >> arithmetic rightshift, >>> logical rightshift 
    # (most C Impl.: >> arithm on signed, logical on unsigned)
    # << logical/arithmetic leftshift
    xx = (xx >>> 32) | (xx << 32) 
    xx = xx*xx + z
    xx = (xx >>> 32) | (xx << 32)
    return UInt32((xx*xx + yy) >> 32)
end

struct RVGenerator
    pseudoInverse
    stop::UInt64
end

function Base.iterate(rvg::RVGenerator, state::UInt64=UInt64(0))
    if rvg.stop >= state
        return (rvg.pseudoInverse(Float32(squares_rng(state, key))/typemax(UInt32)), state+1)
    else
        return nothing
    end
end

which I called like this:

sample_extreme_values(100,1000, x->x)

But this results in a giant “dynamic function invocation” stacktrace

ERROR: InvalidIRError: compiling kernel gpu_parallel!(CUDA.CuDeviceArray{Float32,1,CUDA.AS.Global}, var"#31#32", Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to squares_rng)
Stacktrace:
 [1] iterate at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\rngSquares.jl:45
 [2] multiple call sites at unknown:0
Reason: unsupported dynamic function invocation
Stacktrace:
 [1] iterate at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\rngSquares.jl:45
 [2] multiple call sites at unknown:0
Reason: unsupported call to the Julia runtime (call to jl_f_tuple)
Stacktrace:
 [1] iterate at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\rngSquares.jl:45
 [2] multiple call sites at unknown:0
Reason: unsupported call to the Julia runtime (call to jl_f_getfield)
Stacktrace:
 [1] largest at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\OrderStatistics.jl:28
 [2] gpu_parallel! at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\OrderStatistics.jl:13
Reason: unsupported dynamic function invocation (call to max)
Stacktrace:
 [1] largest at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\OrderStatistics.jl:29
 [2] gpu_parallel! at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\OrderStatistics.jl:13
Reason: unsupported dynamic function invocation (call to convert)
Stacktrace:
 [1] setindex! at C:\Users\felix\.julia\packages\CUDA\dZvbp\src\device\array.jl:101
 [2] gpu_parallel! at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\OrderStatistics.jl:13

I am a bit suprised about this, as I thought that julia would compile julia functions into __device__ functions which can then be called within the kernel.

Is that wrong? If yes, what CAN julia do? What can I expect to work? What can I work with?

Hello @FelixBenning, there are technical limitations on dynamic function execution on the GPU,
so the GPU compiler errors when it sees code that has dynamic type instabilities. Looking at your
code it seems that pseudoInverse has no type declaration.

A good way to figure out these issues is to use CUDA.@device_code_typed and if you install Cthulhu you can use the interactive mode to find your instabilities.

How can I give a type declaration to a function? I can not find a Callable{(Float32),Float32} type (i.e. a callable which accepts a Float32 and returns a Float32.

Make your struct parametric.

1 Like

@maleadt this?

struct RVGenerator{T}
    pseudoInverse::T
    stop::UInt64
end

that reduced the list of red reasons to this:

ERROR: InvalidIRError: compiling kernel gpu_parallel!(CUDA.CuDeviceArray{Float32,1,CUDA.AS.Global}, var"#39#40", Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to Main.OrderStatistics.RVGenerator)
Stacktrace:
 [1] gpu_parallel! at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\OrderStatistics.jl:12
Stacktrace:
 [1] check_ir(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget,CUDA.CUDACompilerParams}, ::LLVM.Module) at C:\Users\felix\.julia\packages\GPUCompiler\GKp4B\src\validation.jl:123
 [2] macro expansion at C:\Users\felix\.julia\packages\GPUCompiler\GKp4B\src\driver.jl:241 [inlined]
 [3] macro expansion at C:\Users\felix\.julia\packages\TimerOutputs\dVnaw\src\TimerOutput.jl:206 [inlined]
 [4] codegen(::Symbol, ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget,CUDA.CUDACompilerParams}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at C:\Users\felix\.julia\packages\GPUCompiler\GKp4B\src\driver.jl:239
 [5] compile(::Symbol, ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget,CUDA.CUDACompilerParams}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at C:\Users\felix\.julia\packages\GPUCompiler\GKp4B\src\driver.jl:39
 [6] compile at C:\Users\felix\.julia\packages\GPUCompiler\GKp4B\src\driver.jl:35 [inlined]
 [7] _cufunction(::GPUCompiler.FunctionSpec{typeof(Main.OrderStatistics.gpu_parallel!),Tuple{CUDA.CuDeviceArray{Float32,1,CUDA.AS.Global},var"#39#40",Int64}}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at C:\Users\felix\.julia\packages\CUDA\dZvbp\src\compiler\execution.jl:310
 [8] _cufunction at C:\Users\felix\.julia\packages\CUDA\dZvbp\src\compiler\execution.jl:304 [inlined]
 [9] check_cache(::typeof(CUDA._cufunction), ::GPUCompiler.FunctionSpec{typeof(Main.OrderStatistics.gpu_parallel!),Tuple{CUDA.CuDeviceArray{Float32,1,CUDA.AS.Global},var"#39#40",Int64}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at C:\Users\felix\.julia\packages\GPUCompiler\GKp4B\src\cache.jl:24
 [10] gpu_parallel! at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\OrderStatistics.jl:9 [inlined]
 [11] cached_compilation(::typeof(CUDA._cufunction), ::GPUCompiler.FunctionSpec{typeof(Main.OrderStatistics.gpu_parallel!),Tuple{CUDA.CuDeviceArray{Float32,1,CUDA.AS.Global},var"#39#40",Int64}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at C:\Users\felix\.julia\packages\GPUCompiler\GKp4B\src\cache.jl:0
 [12] cached_compilation at C:\Users\felix\.julia\packages\GPUCompiler\GKp4B\src\cache.jl:40 [inlined]
 [13] cufunction(::typeof(Main.OrderStatistics.gpu_parallel!), ::Type{Tuple{CUDA.CuDeviceArray{Float32,1,CUDA.AS.Global},var"#39#40",Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at C:\Users\felix\.julia\packages\CUDA\dZvbp\src\compiler\execution.jl:298
 [14] cufunction(::typeof(Main.OrderStatistics.gpu_parallel!), ::Type{Tuple{CUDA.CuDeviceArray{Float32,1,CUDA.AS.Global},var"#39#40",Int64}}) at C:\Users\felix\.julia\packages\CUDA\dZvbp\src\compiler\execution.jl:293
 [15] macro expansion at C:\Users\felix\.julia\packages\CUDA\dZvbp\src\compiler\execution.jl:109 [inlined]
 [16] sample_extreme_values(::Int64, ::Int64, ::Function) at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\OrderStatistics.jl:21
 [17] top-level scope at REPL[54]:1

I also tried

OrderStatistics.sample_extreme_values(100,1000, x::Float32->x::Float32)

as a guess what to do with @vchuravy suggestion, but that did not change anything

@vchuravy

CUDA.@device_code_typed @cuda threads=256 blocks=numblocks gpu_parallel!(gpu_res, x->x, sampleSize)

ERROR: InvalidIRError: compiling kernel gpu_parallel!(CUDA.CuDeviceArray{Float32,1,CUDA.AS.Global}, var"#43#44", Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to Main.OrderStatistics.RVGenerator)

I am not sure how to use this macro…

In combination with Tim’s suggestion you also need to make sure that the kernel specializes on the function


function gpu_parallel!(results, pseudoinverse::F, sampleSize) where F
    index = (CUDA.blockIdx().x - 1) * CUDA.blockDim().x + CUDA.threadIdx().x
    stride = CUDA.blockDim().x * CUDA.gridDim().x
    for thread  = index:stride:length(results)
        generator = RVGenerator(pseudoinverse, sampleSize)
        results[thread] = largest(generator)
    end
end

@vchuravy this did not change anything. I am still getting

OrderStatistics.sample_extreme_values(100,1000, x::Float32->x::Float32)
ERROR: InvalidIRError: compiling kernel gpu_parallel!(CUDA.CuDeviceArray{Float32,1,CUDA.AS.Global}, var"#3#4", Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to Main.OrderStatistics.RVGenerator)
Stacktrace:
 [1] gpu_parallel! at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\OrderStatistics.jl:12

I think I will reduce the complexity a bit in a branch and see if I can get it to work with less functional programming. I guess I will throw out the iterator first - the idea there was, that you could eventually pass a different function other than largest which would need to accept a generator and return the values you want to have at the end. So you could write a function which picks out the n-th largest and largest element for example.

So if it is possible I would still love to get it to work. But I just don’t have enough intuition about julia yet to make progress on my own. (Maybe this wasn’t the best idea for a first project :sweat_smile:)

this simplified version:

module OrderStatistics

import CUDA
import CUDA: @cuda

# Implementation of Widynski, Bernard (2020). 
# "Squares: A Fast Counter-Based RNG". arXiv:2004.06278v2

# key (seed) taken from keys.h (line 2193) in the distribution of Squares
# see https://squaresrng.wixsite.com/rand

# this distribution also includes a generator for these keys - eventually
# this hardcoded key should be replaced with such a generator 
# (one key generates 2^64 random numbers)
key = 0x86d47f132b79acfd
@inline function squares_rng(counter::UInt64, seed::UInt64)::UInt32
    yy = counter * seed
    z = yy + seed
    xx = yy * (yy+1)
    # >> arithmetic rightshift, >>> logical rightshift 
    # (most C Impl.: >> arithm on signed, logical on unsigned)
    # << logical/arithmetic leftshift
    xx = (xx >>> 32) | (xx << 32) 
    xx = xx*xx + z
    xx = (xx >>> 32) | (xx << 32)
    return UInt32((xx*xx + yy) >> 32)
end

function gpu_parallel!(results, pseudoinverse::F, sampleSize) where F
    index = (CUDA.blockIdx().x - 1) * CUDA.blockDim().x + CUDA.threadIdx().x
    stride = CUDA.blockDim().x * CUDA.gridDim().x
    for thread in index:stride:length(results)
        result = 0
        for idx in 1:sampleSize
            result = max(result, pseudoinverse(Float32(squares_rng(idx, key))/typemax(UInt32)))
        end
        results[thread] = result 
    end
end

function sample_extreme_values(sampleSize, superSampleSize, pseudoInverse)::Float32
    numblocks = ceil(Int, superSampleSize/256)
    gpu_res = CUDA.CuArray{Float32}(undef, superSampleSize)
    cpu_res = Array{Float32}(undef, superSampleSize)
    @cuda threads=256 blocks=numblocks gpu_parallel!(gpu_res, pseudoInverse, sampleSize)
    copyto!(cpu_res, gpu_res)
    return cpu_res
end

end # module

still does not work:

OrderStatistics.sample_extreme_values(100,1000, x::Float32->x::Float32)
ERROR: InvalidIRError: compiling kernel gpu_parallel!(CUDA.CuDeviceArray{Float32,1,CUDA.AS.Global}, var"#15#16", Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to squares_rng)
Stacktrace:
 [1] gpu_parallel! at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\OrderStatistics.jl:36

I also tried removing the @inline macro but that did not change the error message. Given that squares_rng has all types and only uses basic operations it should really be compiled as a __device__ function. So I am scratching my head here…

(oh, and I know that every thread currently produces the same random variables - I will have to fix that eventually with offsets but I didn’t want to deal with that right now)

EDIT: I removed the pseudoinverse for now, no effect

You should follow @vchuravy’s advice:

Doing so, it’s immediately clear that the call to squares_rng is not inferred:

julia> @device_code_warntype interactive=true main()
   call #squares_rng(::Int64,::Any)::Union{}

Note the ::Any, you’re using an untyped global variable. https://docs.julialang.org/en/v1/manual/performance-tips/#Avoid-global-variables. Furthermore, your first argument idx is of type Int64 while you explicitly define squares_rng to only accept UInt64.

Fixing both then leads to the issue that your sample_extreme_values function is defined ::Float32 while you return an Array{Float32}.

If your code is functional and well-typed, then that’s exactly what CUDA.jl does :slightly_smiling_face:

2 Likes

@maleadt thank you! The global variable was the missing piece, the wrong return type had a nice error message - I am still unsure how to use CUDA.@device_code_typed though. I googled it but the first result about it is this thread. I can not find any documentation on it. And prepending it before the function with the @cuda annotation did not change the error message nor provided any other useful indication.

I did not have time to try out Cthulhu though, I’ll definiteyl try that out sometime. I am guessing that @device_code_warntype is part of the Cthulhu package? Also what is main()?

@device_code_typed is like @code_typed, as the docstring suggests, so you should be getting the necessary info from there. Same with @device_code_warntype/@code_warntype, please consule the regular Julia docs for that. main() is just the function I put your toplevel code in so I could Revise it.