I’m currently developing a package to provide CUDA.jl support for solving hyperbolic PDEs on GPUs. I recently completed some coarse timing tests on the core acceleration parts of the solvers using a script similar to this example.jl. I found that during the first run, the GPU code generally runs faster than the CPU code.
For example, see the timings for completing the same tasks on GPU and CPU below
[ Info: Time for prolong2mortars! on GPU
0.007657 seconds (3.26 k CPU allocations: 228.750 KiB)
[ Info: Time for prolong2mortars! on CPU
5.049841 seconds (10.14 M allocations: 599.057 MiB, 2.38% gc time, 100.00% compilation time)
[ Info: Time for mortar_flux! on GPU
0.004028 seconds (1.47 k CPU allocations: 103.203 KiB)
[ Info: Time for mortar_flux! on CPU
22.798600 seconds (15.53 M allocations: 725.548 MiB, 1.44% gc time, 100.00% compilation time)
Since the first run mostly consists of compilation time, I’m curious: Is the Julia GPU compiler specifically optimized to compile faster than the Julia compiler? Or is it just accidental that GPU code compiles faster than CPU code?
I checked Tim’s paper (essentially the prototype and early version of the current CUDA.jl), but it seems to focus more on how the Julia GPU compiler is designed for better development and optimization for runtime performance. It compares Julia GPU compiler with nvcc
, not the Julia compiler - would it be meaningful to compare them? (I’m not sure about it).
For the second run, I found that the GPU code has more CPU allocations than the CPU code. For example, see
[ Info: Time for surface_integral! on GPU
0.001121 seconds (35 CPU allocations: 2.031 KiB) (1 GPU allocation: 8 bytes, 1.36% memmgmt time)
[ Info: Time for surface_integral! on CPU
0.000054 seconds (1 allocation: 144 bytes)
[ Info: Time for jacobian! on GPU
0.001140 seconds (25 CPU allocations: 1.406 KiB)
[ Info: Time for jacobian! on CPU
0.000018 seconds
It is similar to what others have observed here GPU code has a high amount of CPU allocations?. I tried profiling kernels using Nsight Compute and found that one issue might be that the launch size is a little small to fully utilize the SMs (this relates to the PDE problem settings, so I will resolve it myself). But beyond this, I am somewhat concerned about the large CPU allocations in the GPU code.
I suspect there might be some type instability issues within CUDA.jl, but I am not sure if solving these would significantly improve performance. Has anyone already worked on identifying type instabilities in CUDA.jl? If it’s considered worthwhile, I’d be happy to help.
One more question: It seems that the original CUDAnative.jl has been split into CUDA.jl and GPUCompiler.jl, and CUDAdrv.jl has also been merged into CUDA.jl. Why is that?
I would appreciate it if anyone could answer any of my questions. Thanks!