Julia compiler v.s. Julia GPU compiler

I’m currently developing a package to provide CUDA.jl support for solving hyperbolic PDEs on GPUs. I recently completed some coarse timing tests on the core acceleration parts of the solvers using a script similar to this example.jl. I found that during the first run, the GPU code generally runs faster than the CPU code.

For example, see the timings for completing the same tasks on GPU and CPU below

[ Info: Time for prolong2mortars! on GPU
  0.007657 seconds (3.26 k CPU allocations: 228.750 KiB)
[ Info: Time for prolong2mortars! on CPU
  5.049841 seconds (10.14 M allocations: 599.057 MiB, 2.38% gc time, 100.00% compilation time)
[ Info: Time for mortar_flux! on GPU
  0.004028 seconds (1.47 k CPU allocations: 103.203 KiB)
[ Info: Time for mortar_flux! on CPU
 22.798600 seconds (15.53 M allocations: 725.548 MiB, 1.44% gc time, 100.00% compilation time)

Since the first run mostly consists of compilation time, I’m curious: Is the Julia GPU compiler specifically optimized to compile faster than the Julia compiler? Or is it just accidental that GPU code compiles faster than CPU code?

I checked Tim’s paper (essentially the prototype and early version of the current CUDA.jl), but it seems to focus more on how the Julia GPU compiler is designed for better development and optimization for runtime performance. It compares Julia GPU compiler with nvcc, not the Julia compiler - would it be meaningful to compare them? (I’m not sure about it).

For the second run, I found that the GPU code has more CPU allocations than the CPU code. For example, see

[ Info: Time for surface_integral! on GPU
  0.001121 seconds (35 CPU allocations: 2.031 KiB) (1 GPU allocation: 8 bytes, 1.36% memmgmt time)
[ Info: Time for surface_integral! on CPU
  0.000054 seconds (1 allocation: 144 bytes)
[ Info: Time for jacobian! on GPU
  0.001140 seconds (25 CPU allocations: 1.406 KiB)
[ Info: Time for jacobian! on CPU
  0.000018 seconds

It is similar to what others have observed here GPU code has a high amount of CPU allocations?. I tried profiling kernels using Nsight Compute and found that one issue might be that the launch size is a little small to fully utilize the SMs (this relates to the PDE problem settings, so I will resolve it myself). But beyond this, I am somewhat concerned about the large CPU allocations in the GPU code.

I suspect there might be some type instability issues within CUDA.jl, but I am not sure if solving these would significantly improve performance. Has anyone already worked on identifying type instabilities in CUDA.jl? If it’s considered worthwhile, I’d be happy to help.

One more question: It seems that the original CUDAnative.jl has been split into CUDA.jl and GPUCompiler.jl, and CUDAdrv.jl has also been merged into CUDA.jl. Why is that?

I would appreciate it if anyone could answer any of my questions. Thanks!

3 Likes

Not specifically, no. It reuses most of the Julia compiler, so performance is expected to be similar. But GPU code is often simpler, so maybe that’s why you’re observing a difference.

These allocations often don’t matter. They’re typically tiny, and Julia’s allocator is really fast for such objects, or at least much faster than the typical latencies you’d observe when working with a GPU (launch overhead, API call overhead, etc).

Furthermore, some allocations are impossible to avoid. For example, every kernel launch requires allocating boxes to store arguments in. Some of the allocations may be avoided by improving type stability of the code, but the performance-sensitive parts (e.g., kernel launches) should have been properly optimized already.

Most importantly, by splitting out the compiler into GPUCompiler.jl we made it possible for other GPU back-ends to reuse that code. But another goal was to simplify things: Users didn’t care about there being several packages, they just wanted to use CUDA in Julia, so we merged everything together into a single back-end package. Since then the CUDA API wrappers (previously CUDAdrv.jl) have become intertwined with some of the CUDA runtime-like functionality from CUDA.jl (e.g. nonblocking synchronization, implicit task-bound state, etc), so the split wouldn’t make sense anymore right now.

4 Likes

Thanks! It is clear and helpful.

I’m also curious: As a developer of CUDA.jl, how do you address the benchmark of a language that relies on a JIT compiler like Julia? For instance, in your paper (Figure 4), you analyze the first-time performance and subsequent performance separately. And in Figure 6, you mention using a lognormal distribution model to fit the execution times - does this fitting also include the first-time execution (i.e., compilation time), or is it excluded?

In short, what percentage of the overall performance measurement do you think should account for compilation time (i.e., first-time execution) for CUDA.jl? Or should this percentage depend on the specific use case?

IMO compilation time is mostly irrelevant, especially for GPU applications which are unlikely to use JIT compilation once the application has reached a steady state. We only added those measurements because reviewers (with experience in ahead-of-time compiled static languages) asked for them. That said, if we were to write that paper today, most of the compilation time would be gone since we’re able to precompile native code now (package images), as well as include GPU code in those caches (see recent PRs in GPUCompiler.jl, and one outstanding one wrt. disk caches).

2 Likes

Thanks - that is really helpful!