Compiling kernel resulted in invalid LLVM IR Reason: unsupported dynamic function invocation

FelixBenning · October 15, 2020, 6:54pm

So I am trying to write an extreme value/ order statistics package for GPUs. The idea is, that getting the maximum of a large number of random variables multiple times in order to get a max distribution should lend itself to parallelization.

But to allow for arbitrary distributions I want my function to accept a pseudoinverse. So I attempted this

import CUDA
import CUDA: @cuda

function sample_extreme_values(sampleSize, superSampleSize, pseudoInverse)::Float32
    numblocks = ceil(Int, superSampleSize/256)
    gpu_res = CUDA.CuArray{Float32}(undef, superSampleSize)
    cpu_res = Array{Float32}(undef, superSampleSize)
    @cuda threads=256 blocks=numblocks gpu_parallel!(gpu_res, pseudoInverse, sampleSize)
    copyto!(cpu_res, gpu_res)
    return cpu_res
end

function gpu_parallel!(results, pseudoinverse, sampleSize)
    index = (CUDA.blockIdx().x - 1) * CUDA.blockDim().x + CUDA.threadIdx().x
    stride = CUDA.blockDim().x * CUDA.gridDim().x
    for thread  = index:stride:length(results)
        generator = RVGenerator(pseudoinverse, sampleSize)
        results[thread] = largest(generator)
    end
end


function largest(generator)
    largest = Float32(0)
    for rv = generator
        largest = max(largest, rv)
    end
    return largest
end


# Implementation of Widynski, Bernard (2020). 
# "Squares: A Fast Counter-Based RNG". arXiv:2004.06278v2

# key (seed) taken from keys.h (line 2193) in the distribution of Squares
# see https://squaresrng.wixsite.com/rand

# this distribution also includes a generator for these keys - eventually
# this hardcoded key should be replaced with such a generator 
# (one key generates 2^64 random numbers)
key = 0x86d47f132b79acfd

@inline function squares_rng(counter::UInt64, seed::UInt64)::UInt32
    yy = counter * seed
    z = yy + seed
    xx = yy * (yy+1)
    # >> arithmetic rightshift, >>> logical rightshift 
    # (most C Impl.: >> arithm on signed, logical on unsigned)
    # << logical/arithmetic leftshift
    xx = (xx >>> 32) | (xx << 32) 
    xx = xx*xx + z
    xx = (xx >>> 32) | (xx << 32)
    return UInt32((xx*xx + yy) >> 32)
end

struct RVGenerator
    pseudoInverse
    stop::UInt64
end

function Base.iterate(rvg::RVGenerator, state::UInt64=UInt64(0))
    if rvg.stop >= state
        return (rvg.pseudoInverse(Float32(squares_rng(state, key))/typemax(UInt32)), state+1)
    else
        return nothing
    end
end

which I called like this:

sample_extreme_values(100,1000, x->x)

But this results in a giant “dynamic function invocation” stacktrace

ERROR: InvalidIRError: compiling kernel gpu_parallel!(CUDA.CuDeviceArray{Float32,1,CUDA.AS.Global}, var"#31#32", Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to squares_rng)
Stacktrace:
 [1] iterate at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\rngSquares.jl:45
 [2] multiple call sites at unknown:0
Reason: unsupported dynamic function invocation
Stacktrace:
 [1] iterate at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\rngSquares.jl:45
 [2] multiple call sites at unknown:0
Reason: unsupported call to the Julia runtime (call to jl_f_tuple)
Stacktrace:
 [1] iterate at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\rngSquares.jl:45
 [2] multiple call sites at unknown:0
Reason: unsupported call to the Julia runtime (call to jl_f_getfield)
Stacktrace:
 [1] largest at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\OrderStatistics.jl:28
 [2] gpu_parallel! at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\OrderStatistics.jl:13
Reason: unsupported dynamic function invocation (call to max)
Stacktrace:
 [1] largest at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\OrderStatistics.jl:29
 [2] gpu_parallel! at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\OrderStatistics.jl:13
Reason: unsupported dynamic function invocation (call to convert)
Stacktrace:
 [1] setindex! at C:\Users\felix\.julia\packages\CUDA\dZvbp\src\device\array.jl:101
 [2] gpu_parallel! at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\OrderStatistics.jl:13

I am a bit suprised about this, as I thought that julia would compile julia functions into __device__ functions which can then be called within the kernel.

Is that wrong? If yes, what CAN julia do? What can I expect to work? What can I work with?

vchuravy · October 15, 2020, 7:09pm

Hello @FelixBenning, there are technical limitations on dynamic function execution on the GPU,
so the GPU compiler errors when it sees code that has dynamic type instabilities. Looking at your
code it seems that pseudoInverse has no type declaration.

A good way to figure out these issues is to use CUDA.@device_code_typed and if you install Cthulhu you can use the interactive mode to find your instabilities.

FelixBenning · October 15, 2020, 7:42pm

How can I give a type declaration to a function? I can not find a Callable{(Float32),Float32} type (i.e. a callable which accepts a Float32 and returns a Float32.

maleadt · October 15, 2020, 7:45pm

Make your struct parametric.

FelixBenning · October 15, 2020, 7:56pm

@maleadt this?

struct RVGenerator{T}
    pseudoInverse::T
    stop::UInt64
end

that reduced the list of red reasons to this:

ERROR: InvalidIRError: compiling kernel gpu_parallel!(CUDA.CuDeviceArray{Float32,1,CUDA.AS.Global}, var"#39#40", Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to Main.OrderStatistics.RVGenerator)
Stacktrace:
 [1] gpu_parallel! at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\OrderStatistics.jl:12
Stacktrace:
 [1] check_ir(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget,CUDA.CUDACompilerParams}, ::LLVM.Module) at C:\Users\felix\.julia\packages\GPUCompiler\GKp4B\src\validation.jl:123
 [2] macro expansion at C:\Users\felix\.julia\packages\GPUCompiler\GKp4B\src\driver.jl:241 [inlined]
 [3] macro expansion at C:\Users\felix\.julia\packages\TimerOutputs\dVnaw\src\TimerOutput.jl:206 [inlined]
 [4] codegen(::Symbol, ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget,CUDA.CUDACompilerParams}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at C:\Users\felix\.julia\packages\GPUCompiler\GKp4B\src\driver.jl:239
 [5] compile(::Symbol, ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget,CUDA.CUDACompilerParams}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at C:\Users\felix\.julia\packages\GPUCompiler\GKp4B\src\driver.jl:39
 [6] compile at C:\Users\felix\.julia\packages\GPUCompiler\GKp4B\src\driver.jl:35 [inlined]
 [7] _cufunction(::GPUCompiler.FunctionSpec{typeof(Main.OrderStatistics.gpu_parallel!),Tuple{CUDA.CuDeviceArray{Float32,1,CUDA.AS.Global},var"#39#40",Int64}}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at C:\Users\felix\.julia\packages\CUDA\dZvbp\src\compiler\execution.jl:310
 [8] _cufunction at C:\Users\felix\.julia\packages\CUDA\dZvbp\src\compiler\execution.jl:304 [inlined]
 [9] check_cache(::typeof(CUDA._cufunction), ::GPUCompiler.FunctionSpec{typeof(Main.OrderStatistics.gpu_parallel!),Tuple{CUDA.CuDeviceArray{Float32,1,CUDA.AS.Global},var"#39#40",Int64}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at C:\Users\felix\.julia\packages\GPUCompiler\GKp4B\src\cache.jl:24
 [10] gpu_parallel! at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\OrderStatistics.jl:9 [inlined]
 [11] cached_compilation(::typeof(CUDA._cufunction), ::GPUCompiler.FunctionSpec{typeof(Main.OrderStatistics.gpu_parallel!),Tuple{CUDA.CuDeviceArray{Float32,1,CUDA.AS.Global},var"#39#40",Int64}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at C:\Users\felix\.julia\packages\GPUCompiler\GKp4B\src\cache.jl:0
 [12] cached_compilation at C:\Users\felix\.julia\packages\GPUCompiler\GKp4B\src\cache.jl:40 [inlined]
 [13] cufunction(::typeof(Main.OrderStatistics.gpu_parallel!), ::Type{Tuple{CUDA.CuDeviceArray{Float32,1,CUDA.AS.Global},var"#39#40",Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at C:\Users\felix\.julia\packages\CUDA\dZvbp\src\compiler\execution.jl:298
 [14] cufunction(::typeof(Main.OrderStatistics.gpu_parallel!), ::Type{Tuple{CUDA.CuDeviceArray{Float32,1,CUDA.AS.Global},var"#39#40",Int64}}) at C:\Users\felix\.julia\packages\CUDA\dZvbp\src\compiler\execution.jl:293
 [15] macro expansion at C:\Users\felix\.julia\packages\CUDA\dZvbp\src\compiler\execution.jl:109 [inlined]
 [16] sample_extreme_values(::Int64, ::Int64, ::Function) at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\OrderStatistics.jl:21
 [17] top-level scope at REPL[54]:1

I also tried

OrderStatistics.sample_extreme_values(100,1000, x::Float32->x::Float32)

as a guess what to do with @vchuravy suggestion, but that did not change anything

FelixBenning · October 15, 2020, 8:07pm

@vchuravy

CUDA.@device_code_typed @cuda threads=256 blocks=numblocks gpu_parallel!(gpu_res, x->x, sampleSize)

ERROR: InvalidIRError: compiling kernel gpu_parallel!(CUDA.CuDeviceArray{Float32,1,CUDA.AS.Global}, var"#43#44", Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to Main.OrderStatistics.RVGenerator)

I am not sure how to use this macro…

vchuravy · October 15, 2020, 10:31pm

In combination with Tim’s suggestion you also need to make sure that the kernel specializes on the function


function gpu_parallel!(results, pseudoinverse::F, sampleSize) where F
    index = (CUDA.blockIdx().x - 1) * CUDA.blockDim().x + CUDA.threadIdx().x
    stride = CUDA.blockDim().x * CUDA.gridDim().x
    for thread  = index:stride:length(results)
        generator = RVGenerator(pseudoinverse, sampleSize)
        results[thread] = largest(generator)
    end
end

FelixBenning · October 16, 2020, 6:43am

@vchuravy this did not change anything. I am still getting

OrderStatistics.sample_extreme_values(100,1000, x::Float32->x::Float32)
ERROR: InvalidIRError: compiling kernel gpu_parallel!(CUDA.CuDeviceArray{Float32,1,CUDA.AS.Global}, var"#3#4", Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to Main.OrderStatistics.RVGenerator)
Stacktrace:
 [1] gpu_parallel! at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\OrderStatistics.jl:12

I think I will reduce the complexity a bit in a branch and see if I can get it to work with less functional programming. I guess I will throw out the iterator first - the idea there was, that you could eventually pass a different function other than largest which would need to accept a generator and return the values you want to have at the end. So you could write a function which picks out the n-th largest and largest element for example.

So if it is possible I would still love to get it to work. But I just don’t have enough intuition about julia yet to make progress on my own. (Maybe this wasn’t the best idea for a first project )

FelixBenning · October 16, 2020, 7:02am

this simplified version:

module OrderStatistics

import CUDA
import CUDA: @cuda

# Implementation of Widynski, Bernard (2020). 
# "Squares: A Fast Counter-Based RNG". arXiv:2004.06278v2

# key (seed) taken from keys.h (line 2193) in the distribution of Squares
# see https://squaresrng.wixsite.com/rand

# this distribution also includes a generator for these keys - eventually
# this hardcoded key should be replaced with such a generator 
# (one key generates 2^64 random numbers)
key = 0x86d47f132b79acfd
@inline function squares_rng(counter::UInt64, seed::UInt64)::UInt32
    yy = counter * seed
    z = yy + seed
    xx = yy * (yy+1)
    # >> arithmetic rightshift, >>> logical rightshift 
    # (most C Impl.: >> arithm on signed, logical on unsigned)
    # << logical/arithmetic leftshift
    xx = (xx >>> 32) | (xx << 32) 
    xx = xx*xx + z
    xx = (xx >>> 32) | (xx << 32)
    return UInt32((xx*xx + yy) >> 32)
end

function gpu_parallel!(results, pseudoinverse::F, sampleSize) where F
    index = (CUDA.blockIdx().x - 1) * CUDA.blockDim().x + CUDA.threadIdx().x
    stride = CUDA.blockDim().x * CUDA.gridDim().x
    for thread in index:stride:length(results)
        result = 0
        for idx in 1:sampleSize
            result = max(result, pseudoinverse(Float32(squares_rng(idx, key))/typemax(UInt32)))
        end
        results[thread] = result 
    end
end

function sample_extreme_values(sampleSize, superSampleSize, pseudoInverse)::Float32
    numblocks = ceil(Int, superSampleSize/256)
    gpu_res = CUDA.CuArray{Float32}(undef, superSampleSize)
    cpu_res = Array{Float32}(undef, superSampleSize)
    @cuda threads=256 blocks=numblocks gpu_parallel!(gpu_res, pseudoInverse, sampleSize)
    copyto!(cpu_res, gpu_res)
    return cpu_res
end

end # module

still does not work:

OrderStatistics.sample_extreme_values(100,1000, x::Float32->x::Float32)
ERROR: InvalidIRError: compiling kernel gpu_parallel!(CUDA.CuDeviceArray{Float32,1,CUDA.AS.Global}, var"#15#16", Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to squares_rng)
Stacktrace:
 [1] gpu_parallel! at D:\Google Drive\CodingPlayground\julia\OrderStatistics\src\OrderStatistics.jl:36

I also tried removing the @inline macro but that did not change the error message. Given that squares_rng has all types and only uses basic operations it should really be compiled as a __device__ function. So I am scratching my head here…

(oh, and I know that every thread currently produces the same random variables - I will have to fix that eventually with offsets but I didn’t want to deal with that right now)

EDIT: I removed the pseudoinverse for now, no effect

maleadt · October 16, 2020, 9:31am

You should follow @vchuravy’s advice:

Doing so, it’s immediately clear that the call to squares_rng is not inferred:

julia> @device_code_warntype interactive=true main()
   call #squares_rng(::Int64,::Any)::Union{}

Note the ::Any, you’re using an untyped global variable. Performance Tips · The Julia Language. Furthermore, your first argument idx is of type Int64 while you explicitly define squares_rng to only accept UInt64.

Fixing both then leads to the issue that your sample_extreme_values function is defined ::Float32 while you return an Array{Float32}.

If your code is functional and well-typed, then that’s exactly what CUDA.jl does

FelixBenning · October 16, 2020, 1:41pm

@maleadt thank you! The global variable was the missing piece, the wrong return type had a nice error message - I am still unsure how to use CUDA.@device_code_typed though. I googled it but the first result about it is this thread. I can not find any documentation on it. And prepending it before the function with the @cuda annotation did not change the error message nor provided any other useful indication.

I did not have time to try out Cthulhu though, I’ll definiteyl try that out sometime. I am guessing that @device_code_warntype is part of the Cthulhu package? Also what is main()?

maleadt · October 16, 2020, 2:58pm

@device_code_typed is like @code_typed, as the docstring suggests, so you should be getting the necessary info from there. Same with @device_code_warntype/@code_warntype, please consule the regular Julia docs for that. main() is just the function I put your toplevel code in so I could Revise it.

Topic		Replies	Views
CUDA.jl unsupported call to an unknown function, unsupported dynamic function invocation GPU	3	1246	July 11, 2022
Kernel with dynamic parallelism seems to be calling CPU functions GPU	4	125	July 19, 2025
Base function in Cuda kernels General Usage cudanative , cuda	8	3212	March 15, 2019
External functions in GPU ODE example GPU gpu , ode , differentialequation	7	246	July 11, 2024
Bug with Julia 1.7.1 and CUDA 3.3 GPU bug , cuda	26	2398	June 2, 2022

Compiling kernel resulted in invalid LLVM IR Reason: unsupported dynamic function invocation

Related topics