# Can't compile large expression on CUDA.jl

I’m trying to broadcast a large function on a GPU Array, but it fails.

``````using CUDA
v1 = CUDA.rand(10)
function fun1(x)
cos(x) * sin(x) +
cos(2 * x) * sin(x) +
cos(3 * x) * sin(x) +
cos(4 * x) * sin(x) +
cos(5 * x) * sin(x) +
cos(6 * x) * sin(x) +
cos(7 * x) * sin(x) +
cos(8 * x) * sin(x) +
cos(9 * x) * sin(x) +
cos(10 * x) * sin(x) +
cos(11 * x) * sin(x) +
cos(12 * x) * sin(x) +
cos(13 * x) * sin(x) +
cos(14 * x) * sin(x) +
cos(15 * x) * sin(x) +
cos(16 * x) * sin(x) +
cos(17 * x) * sin(x) +
cos(18 * x) * sin(x) +
cos(19 * x) * sin(x) +
cos(20 * x) * sin(x) +
cos(21 * x) * sin(x) +
cos(22 * x) * sin(x) +
cos(23 * x) * sin(x) +
cos(24 * x) * sin(x) +
cos(25 * x) * sin(x) +
cos(26 * x) * sin(x) +
cos(27 * x) * sin(x) +
cos(28 * x) * sin(x) +
cos(29 * x) * sin(x) +
cos(30 * x) * sin(x) +
cos(31 * x) * sin(x) +
cos(32 * x) * sin(x) +
cos(33 * x) * sin(x) +
cos(34 * x) * sin(x) +
cos(35 * x) * sin(x) +
cos(36 * x) * sin(x)
end
function fun2(x)
cos(x) * sin(x) +
cos(2 * x) * sin(x) +
cos(3 * x) * sin(x) +
cos(4 * x) * sin(x) +
cos(5 * x) * sin(x) +
cos(6 * x) * sin(x) +
cos(7 * x) * sin(x) +
cos(8 * x) * sin(x) +
cos(9 * x) * sin(x) +
cos(10 * x) * sin(x) +
cos(11 * x) * sin(x) +
cos(12 * x) * sin(x) +
cos(13 * x) * sin(x) +
cos(14 * x) * sin(x) +
cos(15 * x) * sin(x) +
cos(16 * x) * sin(x) +
cos(17 * x) * sin(x) +
cos(18 * x) * sin(x) +
cos(19 * x) * sin(x) +
cos(20 * x) * sin(x) +
cos(21 * x) * sin(x) +
cos(22 * x) * sin(x) +
cos(23 * x) * sin(x) +
cos(24 * x) * sin(x) +
cos(25 * x) * sin(x) +
cos(26 * x) * sin(x) +
cos(27 * x) * sin(x) +
cos(28 * x) * sin(x) +
cos(29 * x) * sin(x) +
cos(30 * x) * sin(x) +
cos(31 * x) * sin(x) +
cos(32 * x) * sin(x) +
cos(33 * x) * sin(x) +
cos(34 * x) * sin(x) +
cos(35 * x) * sin(x)
end
fun1.(v1)
fun2.(v1)
``````

Here ‘fun1’ fails, while ‘fun2’ works. ‘fun2’ is a bit shorter which I only drop the cos(36 * x) * sin(x) term in the last line.

The error shows:

``````ERROR: InvalidIRError: compiling kernel #broadcast_kernel#17(CUDA.CuKernelContext, CuDeviceVector{Float32, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{1}, Tuple{Base.OneTo{Int64}}, typeof(fun1), Tuple{Base.Broadcast.Extruded{CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to jl_f__apply_iterate)
Stacktrace:
[1] +
@ ./operators.jl:591
[2] fun1
[5] getindex
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
``````

my guess is as expression gets too hard Julia gave up inlining and fall back to the apply_iterate call which CUDA doesn’t handle as expected

1 Like

Is there a way to fix this bug

I don’t know the cause of this issue, but have you tried rewriting the sum as a loop? This might save on the number of registers used in the kernel.

2 Likes

Try this:

``````function fun3(x; n=36)
result = zero(typeof(x))
for i in 1:n
result += cos(i*x)
end
return result * sin(x)
end
``````

You can change `n` to be the number of terms.

Also, the original expression not compiling is probably a bug, which should probably be reported. I am not sure if `CUDA.jl` is the right package to report this on or not, but I am sure someone else can point you in the right direction.

EDIT: Changed `mapreduce` to a normal `for` loop as `mapreduce` would not compile. Current code works.

1 Like

Rewriting the sum as a loop will work. But this is just a minimum example. I want to figure out the largest size of expression which CUDA.jl could handle.

Actually, we want to broadcast a expression without any regular patterns which is about 10000 lines. We have tried this on C++ code with OpenAcc and It fails to compile. But, for a 800 lines of expresion OpenAcc works well.

Of course, we can sepreate this large expression into many small pieces with maybe some metaprogamming tricks. If CUDA.jl can do this automatically that will be very nice.

Thanks for your nice code. It works and very efficient. It only takes about 167μs when broadcasting on a 1000*1000 random matrix for n=35. While, it will takes about 4.6ms for the ‘original expression’.

I will try to report this issue in CUDA.jl.

2 Likes

The difference in timing there is almost certainly because the compiled kernel is smaller and many more threads can run in parallel.

The issue you’re seeing both here and on openacc is likely a hardware limit on the maximum size of the kernel that your hardware will accept. And even if they can run, as you see above, they’re not going to take much advantage of the parallel capabilities of cuda hardware.

I would recommended breaking your ‘10000 variable’ kernels down to a sequence of much smaller kernels that are small enough to take advantage of GPU parallelism and simply run them in sequence.

1 Like

well you shouldn’t expect ANY library to be able to “just” compile arbitrary function vectorized onto GPUs, it’s ~more or less a matter of optimization and heuristics of the compiler and front end that determines how much you can get away with without thinking, but eventually you will have to think