How to precompile CUDA kernel itself?

Hi all.
I am having problems with the CUDA kernel booting slowly the first time. In my application it takes about 60 seconds to boot the first time in the worst case.
So, we tried PrecompileTools, but did not find it very effective. Is there something wrong or is there a better way?

module Test
using CUDA

export mycopy!

function mycopy!(A, B)
    len = length(A)
    @assert len === length(B)

    function kernel()
        i = threadIdx().x + blockDim().x * (blockIdx().x - 1)
        if checkbounds(Bool, A, i) && checkbounds(Bool, B, i)
            A[i] = B[i]
        end
        nothing
    end

    threads = 512
    blocks = cld(len, threads)
    @cuda threads = threads blocks = blocks kernel()
end
end # module Test
module Startup
using CUDA
using Test
using PrecompileTools

@compile_workload begin
    A = CUDA.zeros(100, 100, 100)
    B = CUDA.zeros(100, 100, 100)

    CUDA.@sync mycopy!(A, B)
end
end # module Startup
1 Like

Hi,

For your example on my machine, PrecompileTools.jl reduced the time for first run by some 80%, from around 5 s to around 1.25 s. It’s still more than I’d like, but precompiling clearly helps quite a bit for me.

'Startup' code

Essentially your code, with minor modifications. Following (my interpretation of) the local Startup packages tutorial.
(…)/Startup/src/Test.jl (exactly the same as your Test module):

module Test

using CUDA

export mycopy!

function mycopy!(A, B)
    len = length(A)
    @assert len === length(B)

    function kernel()
        i = threadIdx().x + blockDim().x * (blockIdx().x - 1)
        if checkbounds(Bool, A, i) && checkbounds(Bool, B, i)
            A[i] = B[i]
        end
        nothing
    end

    threads = 512
    blocks = cld(len, threads)
    @cuda threads = threads blocks = blocks kernel()
end

end

(…)/Startup/src/Startup.jl:

module Startup

include("./Test.jl")

using PrecompileTools
using CUDA
using .Test

export mycopy!

@setup_workload begin
    A = CUDA.zeros(100, 100, 100)
    B = CUDA.zeros(100, 100, 100)
	@compile_workload begin
		CUDA.@sync mycopy!(A, B)
	end
end

end # module Startup

REPL, without precompilation:

(@v1.10) pkg> activate Startup
  Activating project at `(...)\Startup`

(Startup) pkg> ^C

julia> include("Startup/src/Test.jl")
Main.Test

julia> using CUDA, .Test

julia> A = CUDA.zeros(100, 100, 100); B = CUDA.zeros(100, 100, 100);

julia> @time CUDA.@sync mycopy!(A, B)
  4.850064 seconds (4.68 M allocations: 321.383 MiB, 1.56% gc time, 98.85% compilation time: 5% of which was recompilation)
CUDA.HostKernel for kernel()

julia> @time CUDA.@sync mycopy!(A, B)
  0.020466 seconds (63 allocations: 4.391 KiB)
CUDA.HostKernel for kernel()

REPL, after precompilation:

(@v1.10) pkg> activate Startup
  Activating project at `(...)\Startup`

(Startup) pkg> ^C

julia> using Startup

julia> using CUDA

julia> A = CUDA.zeros(100, 100, 100); B = CUDA.zeros(100, 100, 100);

julia> @time CUDA.@sync mycopy!(A, B)
  1.263689 seconds (1.30 M allocations: 90.532 MiB, 1.16% gc time, 95.33% compilation time)
CUDA.HostKernel for kernel()

julia> @time CUDA.@sync mycopy!(A, B)
  0.000207 seconds (25 allocations: 1.172 KiB)
CUDA.HostKernel for kernel()
'Package' code

Following (my interpretation of) the forcing precompilation with workloads tutorial.

(…)/Test/src/Test.jl:

module Test

using PrecompileTools: @setup_workload, @compile_workload
using CUDA

export mycopy!

function mycopy!(A, B)
    len = length(A)
    @assert len === length(B)

    function kernel()
        i = threadIdx().x + blockDim().x * (blockIdx().x - 1)
        if checkbounds(Bool, A, i) && checkbounds(Bool, B, i)
            A[i] = B[i]
        end
        nothing
    end

    threads = 512
    blocks = cld(len, threads)
    @cuda threads = threads blocks = blocks kernel()
end


@setup_workload begin
    A = CUDA.zeros(100, 100, 100)
    B = CUDA.zeros(100, 100, 100)
	@compile_workload begin
		CUDA.@sync mycopy!(A, B)
	end
end

end

REPL, after precompilation:

(@v1.10) pkg> activate Test
  Activating project at `(...)\Test`

(Test) pkg> ^C

julia> using Test

julia> using CUDA

julia> A = CUDA.zeros(100, 100, 100); B = CUDA.zeros(100, 100, 100);

julia> @time CUDA.@sync mycopy!(A, B)
  1.223230 seconds (1.30 M allocations: 90.564 MiB, 1.14% gc time, 94.89% compilation time)
CUDA.HostKernel for kernel()

julia> @time CUDA.@sync mycopy!(A, B)
  0.000208 seconds (25 allocations: 1.172 KiB)
CUDA.HostKernel for kernel()
Version info
julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd4843 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  JULIA_NUM_THREADS = auto


(Test) pkg> st
Project Test v0.1.0
Status `(...)\Test\Project.toml`
  [052768ef] CUDA v5.5.2
  [aea7be01] PrecompileTools v1.2.1

CUDA libraries:
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+560.94

Julia packages:
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.3+0
- CUDA_Runtime_jll: 0.15.3+0

Toolchain:
- Julia: 1.10.4
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce RTX 3070 (sm_86, 6.424 GiB / 8.000 GiB available)

Thank you. But I think it is not enough. Many applications will most likely call more than one kernel. One call takes one second, but with 20, it can take as long as 20 seconds.
To measure performance with MPI, I run julia ***.jl many times, each time changing the number of processes. Each time it takes so long to compile that I think it has frozen.

This is suggesting that something is NOT being precompiled. This issue was only resolved last year, but if there was any deep dive into precompilation, it’s not talked about there: Cannot precompile GPU code with PrecompileTools · Issue #2006 · JuliaGPU/CUDA.jl

Yes. That is exactly the why I suspect GPU kernel is not precompiled.

Alternatively, this is also pretty strongly suggested by the fact that the second call runs orders of magnitudes faster :slight_smile: .

Here’s a way to use a cached precompiled kernel. It assumes you have ptxas in your PATH.
In all likelihood there exist more appropriate methods in CUDA.jl. For example, I’m pretty sure cudacall should be preferred over my HostKernel approach, but I couldn’t get it to work. Check out

and

for some more information.


(…)/Test/src/Test/jl:

module Test

using PrecompileTools: @setup_workload, @compile_workload
using CUDA
using CUDA: i32

export mycopy!

function kernel!(A, B)
    i = threadIdx().x + blockDim().x * (blockIdx().x - 1i32)
    if checkbounds(Bool, A, i) && checkbounds(Bool, B, i)
        A[i] = B[i]
    end
    nothing
end

function get_func_name_from_ptx(ptx_path)
    # (Can be written more efficiently)
    ptx_code = read(ptx_path, String)
    return ptx_code[findfirst(Regex("// -- Begin function .*\n"), ptx_code)[begin + length("// -- Begin function "):end-1]]
    # There is a comment "// -- Begin function <func_name>" which ends with a newline
end

function get_compiled_kernel(cubin_path, name, kernel_tt)
    mdl = CuModule(read(cubin_path))
    func = CuFunction(mdl, name)
    return CUDA.HostKernel{typeof(kernel!), kernel_tt}(kernel!, func, CUDA.KernelState(CUDA.create_exceptions!(mdl), UInt32(0)))
end

function mycopy!(A, B, cache_dir="E:/Temp/kernel/")  # Adjust default
    len = length(A)
    @assert len === length(B)
    threads = 512
    blocks = cld(len, threads)
	
    ptx_path = joinpath(cache_dir, "kernel.ptx")
    cubin_path = joinpath(cache_dir, "kernel.cubin")
    func_name_path = joinpath(cache_dir, "name.txt")
    if !ispath(ptx_path)
        # Compile to disk
        mkpath(cache_dir)
        open(ptx_path, "w") do io
            @device_code_ptx io @cuda threads=threads blocks=blocks kernel!(A, B)
        end
        func_name = get_func_name_from_ptx(ptx_path)
        open(func_name_path, "w") do io
            write(io, func_name)
        end
        sm = CUDA.capability(CUDA.CuDevice(0))
        run(`ptxas -arch=sm_$(sm.major)$(sm.minor) --output-file $cubin_path $ptx_path`)  # compile ptx to cubin
    end
    # kernel.ptx, kernel.cubin and name.txt should now exist
    func_name = read(func_name_path, String)
    kernel_args = map(CUDA.cudaconvert, (A, B))  # (cf. @cuda macro)
    kernel_tt = Tuple{map(Core.Typeof, kernel_args)...}
    kern = get_compiled_kernel(cubin_path, func_name, kernel_tt)
    kern(A, B, threads=threads, blocks=blocks)
end

@setup_workload begin
    A = CUDA.zeros(100, 100, 100)
    B = CUDA.zeros(100, 100, 100)
	@compile_workload begin
		CUDA.@sync mycopy!(A, B)
	end
end

end

REPL output:

(@v1.10) pkg> activate Test
  Activating project at `(...)\Test`

(Test) pkg> ^C

julia> using Test
Precompiling Test
  1 dependency successfully precompiled in 11 seconds. 69 already precompiled.
[ Info: Precompiling Test [98d22206-c062-46eb-91c7-b4da6428f19f]

julia> using CUDA

julia> A = CUDA.zeros(100, 100, 100); B = CUDA.ones(100, 100, 100);

julia> @time CUDA.@sync mycopy!(A, B)
  0.001473 seconds (115 allocations: 1.007 MiB)

julia> @time CUDA.@sync mycopy!(A, B)
  0.000652 seconds (115 allocations: 1.007 MiB)

julia> CUDA.@allowscalar A[1]
1.0f0

You could make this a bit more efficient in subsequent runs by keeping mdl and func_name in memory and not rereading it from the disk every time. But in the grand scheme of things, I assume this will be negligible.

1 Like

Which version of Julia are you using? What is your versioninfo()?
Also, can you post a complete MWE? E.g. how are you measuring the time?

There are several stages here, firstly the time it takes to infer the mechanisms for CUDA kernel compilation. For that, I need to revisit Use PrecompileTools to warmup CUDA.jl by vchuravy · Pull Request #2325 · JuliaGPU/CUDA.jl · GitHub

After that, there is the time it takes to infer your own kernels. From 1.11 that can be cached, but it is still a bit finicky as seen in the PR to enable precompilation in CUDA.jl

Lastly, there is the native PTX code, once we can cache your inference result we can cache the PTX code as well Add disk cache infrastructure for Julia 1.11 by vchuravy · Pull Request #557 · JuliaGPU/GPUCompiler.jl · GitHub

TLDR: There is hope, but there are still man sharp corners.

4 Likes

I will show the complete example and versioninfo() and CUDA.versioninfo()
The same trend as @eldee was observed; The first run is certainly faster when precompiled, but still much slower than the second.

TLDR: There is hope, but there are still man sharp corners.

Great!! I hope the improvement come true sooner.

#] generate Startup
#] develope ./Startup

module Startup
using CUDA
using PrecompileTools

export CUDA
export copy_with_precompile!

function copy_with_precompile!(A, B)
    len = length(A)
    @assert len === length(B)

    function kernel()
        i = threadIdx().x + blockDim().x * (blockIdx().x - 1)
        if checkbounds(Bool, A, i) && checkbounds(Bool, B, i)
            A[i] = B[i]
        end
        nothing
    end

    threads = 512
    blocks = cld(len, threads)
    @cuda threads = threads blocks = blocks kernel()
end

@setup_workload begin
    A = CUDA.zeros(100, 100, 100)
    B = CUDA.zeros(100, 100, 100)
    @compile_workload begin
        CUDA.@sync copy_with_precompile!(A, B)
    end
end
end
julia> using Startup

julia> A = CUDA.zeros(100, 100, 100);

julia> B = CUDA.zeros(100, 100, 100);

julia> CUDA.@time CUDA.@sync copy_with_precompile!(A, B)
  1.411800 seconds (880.65 k CPU allocations: 46.825 MiB, 8.04% gc time)
CUDA.HostKernel for kernel()

julia> CUDA.@time CUDA.@sync copy_with_precompile!(A, B)
  0.000173 seconds (22 CPU allocations: 960 bytes)
CUDA.HostKernel for kernel()

julia> CUDA.@time CUDA.@sync copy_with_precompile!(A, B)
  0.000167 seconds (22 CPU allocations: 960 bytes)
CUDA.HostKernel for kernel()

julia> CUDA.@time CUDA.@sync copy_with_precompile!(A, B)
  0.000177 seconds (22 CPU allocations: 960 bytes)
CUDA.HostKernel for kernel()

julia> CUDA.@time CUDA.@sync copy_with_precompile!(A, B)
  0.000178 seconds (22 CPU allocations: 960 bytes)
CUDA.HostKernel for kernel()

julia> CUDA.@time CUDA.@sync copy_with_precompile!(A, B)
  0.000566 seconds (47 CPU allocations: 2.844 KiB)
CUDA.HostKernel for kernel()

julia> CUDA.@time CUDA.@sync copy_with_precompile!(A, B)
  0.000436 seconds (45 CPU allocations: 2.750 KiB)
CUDA.HostKernel for kernel()

julia> CUDA.@time CUDA.@sync copy_with_precompile!(A, B)
  0.000997 seconds (47 CPU allocations: 2.875 KiB)
CUDA.HostKernel for kernel()
julia> using CUDA

julia> function copy_without_precompile!(A, B)
           len = length(A)
           @assert len === length(B)

           function kernel()
               i = threadIdx().x + blockDim().x * (blockIdx().x - 1)
               if checkbounds(Bool, A, i) && checkbounds(Bool, B, i)
                   A[i] = B[i]
               end
               nothing
           end

           threads = 512
           blocks = cld(len, threads)
           @cuda threads = threads blocks = blocks kernel()
       end
copy_without_precompile! (generic function with 1 method)

julia> A = CUDA.zeros(100, 100, 100);

julia> B = CUDA.zeros(100, 100, 100);

julia> CUDA.@time CUDA.@sync copy_without_precompile!(A, B)
  6.742919 seconds (9.52 M CPU allocations: 480.393 MiB, 0.92% gc time)
CUDA.HostKernel for kernel()

julia> CUDA.@time CUDA.@sync copy_without_precompile!(A, B)
  0.036825 seconds (62.46 k CPU allocations: 3.523 MiB)
CUDA.HostKernel for kernel()

julia> CUDA.@time CUDA.@sync copy_without_precompile!(A, B)
  0.000669 seconds (48 CPU allocations: 3.125 KiB)
CUDA.HostKernel for kernel()

julia> CUDA.@time CUDA.@sync copy_without_precompile!(A, B)
  0.001013 seconds (57 CPU allocations: 3.156 KiB)
CUDA.HostKernel for kernel()

julia> CUDA.@time CUDA.@sync copy_without_precompile!(A, B)
  0.001010 seconds (47 CPU allocations: 2.875 KiB)
CUDA.HostKernel for kernel()

julia> CUDA.@time CUDA.@sync copy_without_precompile!(A, B)
  0.001066 seconds (23 CPU allocations: 976 bytes)
CUDA.HostKernel for kernel()

julia> CUDA.@time CUDA.@sync copy_without_precompile!(A, B)
  0.001106 seconds (23 CPU allocations: 976 bytes)
CUDA.HostKernel for kernel()

julia> CUDA.@time CUDA.@sync copy_without_precompile!(A, B)
  0.001036 seconds (23 CPU allocations: 976 bytes)
CUDA.HostKernel for kernel()
julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × 13th Gen Intel(R) Core(TM) i7-13700K
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, alderlake)
Threads: 1 default, 0 interactive, 1 GC (on 24 virtual cores)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 

julia> CUDA.versioninfo()
CUDA runtime 12.6, artifact installation
CUDA driver 12.7
NVIDIA driver 566.3.0

CUDA libraries: 
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+565.57.2

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.3+0
- CUDA_Runtime_jll: 0.15.3+0

Toolchain:
- Julia: 1.11.1
- LLVM: 16.0.6

1 device:
  0: NVIDIA GeForce RTX 4070 Ti (sm_89, 8.262 GiB / 11.994 GiB available)