Can't compile large expression on CUDA.jl

Yang-yang · August 31, 2022, 11:57am

I’m trying to broadcast a large function on a GPU Array, but it fails.

using CUDA
v1 = CUDA.rand(10)
function fun1(x)
    cos(x) * sin(x) +
    cos(2 * x) * sin(x) +
    cos(3 * x) * sin(x) +
    cos(4 * x) * sin(x) +
    cos(5 * x) * sin(x) +
    cos(6 * x) * sin(x) +
    cos(7 * x) * sin(x) +
    cos(8 * x) * sin(x) +
    cos(9 * x) * sin(x) +
    cos(10 * x) * sin(x) +
    cos(11 * x) * sin(x) +
    cos(12 * x) * sin(x) +
    cos(13 * x) * sin(x) +
    cos(14 * x) * sin(x) +
    cos(15 * x) * sin(x) +
    cos(16 * x) * sin(x) +
    cos(17 * x) * sin(x) +
    cos(18 * x) * sin(x) +
    cos(19 * x) * sin(x) +
    cos(20 * x) * sin(x) +
    cos(21 * x) * sin(x) +
    cos(22 * x) * sin(x) +
    cos(23 * x) * sin(x) +
    cos(24 * x) * sin(x) +
    cos(25 * x) * sin(x) +
    cos(26 * x) * sin(x) +
    cos(27 * x) * sin(x) +
    cos(28 * x) * sin(x) +
    cos(29 * x) * sin(x) +
    cos(30 * x) * sin(x) +
    cos(31 * x) * sin(x) +
    cos(32 * x) * sin(x) +
    cos(33 * x) * sin(x) +
    cos(34 * x) * sin(x) +
    cos(35 * x) * sin(x) +
    cos(36 * x) * sin(x)
end
function fun2(x)
    cos(x) * sin(x) +
    cos(2 * x) * sin(x) +
    cos(3 * x) * sin(x) +
    cos(4 * x) * sin(x) +
    cos(5 * x) * sin(x) +
    cos(6 * x) * sin(x) +
    cos(7 * x) * sin(x) +
    cos(8 * x) * sin(x) +
    cos(9 * x) * sin(x) +
    cos(10 * x) * sin(x) +
    cos(11 * x) * sin(x) +
    cos(12 * x) * sin(x) +
    cos(13 * x) * sin(x) +
    cos(14 * x) * sin(x) +
    cos(15 * x) * sin(x) +
    cos(16 * x) * sin(x) +
    cos(17 * x) * sin(x) +
    cos(18 * x) * sin(x) +
    cos(19 * x) * sin(x) +
    cos(20 * x) * sin(x) +
    cos(21 * x) * sin(x) +
    cos(22 * x) * sin(x) +
    cos(23 * x) * sin(x) +
    cos(24 * x) * sin(x) +
    cos(25 * x) * sin(x) +
    cos(26 * x) * sin(x) +
    cos(27 * x) * sin(x) +
    cos(28 * x) * sin(x) +
    cos(29 * x) * sin(x) +
    cos(30 * x) * sin(x) +
    cos(31 * x) * sin(x) +
    cos(32 * x) * sin(x) +
    cos(33 * x) * sin(x) +
    cos(34 * x) * sin(x) +
    cos(35 * x) * sin(x)
end
fun1.(v1)
fun2.(v1)

Here ‘fun1’ fails, while ‘fun2’ works. ‘fun2’ is a bit shorter which I only drop the cos(36 * x) * sin(x) term in the last line.

The error shows:

ERROR: InvalidIRError: compiling kernel #broadcast_kernel#17(CUDA.CuKernelContext, CuDeviceVector{Float32, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(fun1), Tuple{Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to jl_f__apply_iterate)
Stacktrace:
 [1] +
   @ ./operators.jl:591
 [2] fun1
   @ ~/Downloads/test/fun.jl:4
 [3] _broadcast_getindex_evalf
   @ ./broadcast.jl:670
 [4] _broadcast_getindex
   @ ./broadcast.jl:643
 [5] getindex
   @ ./broadcast.jl:597
 [6] broadcast_kernel
   @ ~/.julia/packages/GPUArrays/fqD8z/src/host/broadcast.jl:57
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl

jling · August 31, 2022, 1:08pm

my guess is as expression gets too hard Julia gave up inlining and fall back to the apply_iterate call which CUDA doesn’t handle as expected

Yang-yang · September 3, 2022, 5:59am

Is there a way to fix this bug

torrance · September 3, 2022, 6:46am

I don’t know the cause of this issue, but have you tried rewriting the sum as a loop? This might save on the number of registers used in the kernel.

jmair · September 3, 2022, 8:46am

Try this:

function fun3(x; n=36)
    result = zero(typeof(x))
    for i in 1:n
        result += cos(i*x)
    end
    return result * sin(x)
end

You can change n to be the number of terms.

Also, the original expression not compiling is probably a bug, which should probably be reported. I am not sure if CUDA.jl is the right package to report this on or not, but I am sure someone else can point you in the right direction.

EDIT: Changed mapreduce to a normal for loop as mapreduce would not compile. Current code works.

Yang-yang · September 3, 2022, 5:10pm

Rewriting the sum as a loop will work. But this is just a minimum example. I want to figure out the largest size of expression which CUDA.jl could handle.

Actually, we want to broadcast a expression without any regular patterns which is about 10000 lines. We have tried this on C++ code with OpenAcc and It fails to compile. But, for a 800 lines of expresion OpenAcc works well.

Of course, we can sepreate this large expression into many small pieces with maybe some metaprogamming tricks. If CUDA.jl can do this automatically that will be very nice.

Yang-yang · September 3, 2022, 5:26pm

Thanks for your nice code. It works and very efficient. It only takes about 167μs when broadcasting on a 1000*1000 random matrix for n=35. While, it will takes about 4.6ms for the ‘original expression’.

I will try to report this issue in CUDA.jl.

torrance · September 5, 2022, 1:26am

The difference in timing there is almost certainly because the compiled kernel is smaller and many more threads can run in parallel.

The issue you’re seeing both here and on openacc is likely a hardware limit on the maximum size of the kernel that your hardware will accept. And even if they can run, as you see above, they’re not going to take much advantage of the parallel capabilities of cuda hardware.

I would recommended breaking your ‘10000 variable’ kernels down to a sequence of much smaller kernels that are small enough to take advantage of GPU parallelism and simply run them in sequence.

jling · September 5, 2022, 3:00am

well you shouldn’t expect ANY library to be able to “just” compile arbitrary function vectorized onto GPUs, it’s ~more or less a matter of optimization and heuristics of the compiler and front end that determines how much you can get away with without thinking, but eventually you will have to think

Topic		Replies	Views
Trignometric functions on GPU GPU	4	1083	November 21, 2019
Broadcasting performance Performance gpu , multithreading , broadcasting	13	542	January 6, 2025
Performance of kernel function GPU	3	449	November 28, 2019
Problem with GPU programming GPU cudanative , cuda	4	1055	September 13, 2019
Support for Complex-valued CuArray GPU question	2	714	October 12, 2018

Can't compile large expression on CUDA.jl

Related topics