Can't compile a GPU kernel function for =>

I’m trying to persuade CUDA.jl to compile a kernel function. After reducing the issue I have come to this sample code

using CUDA

x=Float32(2.0)
knots =CuArray{Float32,1}([1.2, 2.0, 3.0])
mapOp(v::Float32)::Int = v >= x ? 1 : 0
mapreduce(mapOp, + , knots)

When I run this I get this

ERROR: InvalidIRError: compiling kernel partial_mapreduce_grid(typeof(identity), typeof(+), Int64, CartesianIndices{1,Tuple{Base.OneTo{Int64}}}, CartesianIndices{1,Tuple{Base.OneTo{Int64}}}, Val{true}, CuDeviceArray{Int64,2,1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1},Tuple{Base.OneTo{Int64}},typeof(mapOp),Tuple{CuDeviceArray{Float32,1,1}}}) resulted in invalid LLVM IR        
Reason: unsupported dynamic function invocation (call to >=)
Stacktrace:
 [1] mapOp at D:\steph\Google Drive\src\julia\bspline\broke2.jl:4

so it seems it can’t compile the >= function as it can’t determine the types that go into it. Hw do I address this?

x is in the global scope, which prevents its type from being concretely inferred. You need to pass it as a function argument a la

julia> count_geq(arr, val) = count(>=(val), arr)
count_geq (generic function with 1 method)

julia> count_geq(cu([1.2, 2.0, 3.0]), 2.0f0)
2

edit: it’s almost as fast to just use count(knots .>= 2).

julia> using BenchmarkTools

julia> knots = 3*CUDA.rand(2^14);

julia> @btime CUDA.@sync count_geq($knots, 2.0f0)
  53.600 μs (77 allocations: 2.06 KiB)
5437

julia> @btime CUDA.@sync count($knots .>= 2)
  55.299 μs (100 allocations: 2.69 KiB)
5437
2 Likes

If x were const, it would have worked too.

1 Like