Hi all.
I am having problems with the CUDA kernel booting slowly the first time. In my application it takes about 60 seconds to boot the first time in the worst case.
So, we tried PrecompileTools, but did not find it very effective. Is there something wrong or is there a better way?
module Test
using CUDA
export mycopy!
function mycopy!(A, B)
len = length(A)
@assert len === length(B)
function kernel()
i = threadIdx().x + blockDim().x * (blockIdx().x - 1)
if checkbounds(Bool, A, i) && checkbounds(Bool, B, i)
A[i] = B[i]
threads = 512
blocks = cld(len, threads)
@cuda threads = threads blocks = blocks kernel()
end # module Test
module Startup
using CUDA
using Test
using PrecompileTools
@compile_workload begin
A = CUDA.zeros(100, 100, 100)
B = CUDA.zeros(100, 100, 100)
CUDA.@sync mycopy!(A, B)
end # module Startup
For your example on my machine, PrecompileTools.jl reduced the time for first run by some 80%, from around 5 s to around 1.25 s. It’s still more than I’d like, but precompiling clearly helps quite a bit for me.
'Startup' code
Essentially your code, with minor modifications. Following (my interpretation of) the local Startup packages tutorial.
(…)/Startup/src/Test.jl (exactly the same as your Test module):
module Test
using CUDA
export mycopy!
function mycopy!(A, B)
len = length(A)
@assert len === length(B)
function kernel()
i = threadIdx().x + blockDim().x * (blockIdx().x - 1)
if checkbounds(Bool, A, i) && checkbounds(Bool, B, i)
A[i] = B[i]
threads = 512
blocks = cld(len, threads)
@cuda threads = threads blocks = blocks kernel()
module Startup
using PrecompileTools
using CUDA
using .Test
export mycopy!
@setup_workload begin
A = CUDA.zeros(100, 100, 100)
B = CUDA.zeros(100, 100, 100)
@compile_workload begin
CUDA.@sync mycopy!(A, B)
end # module Startup
REPL, without precompilation:
(@v1.10) pkg> activate Startup
Activating project at `(...)\Startup`
(Startup) pkg> ^C
julia> include("Startup/src/Test.jl")
julia> using CUDA, .Test
julia> A = CUDA.zeros(100, 100, 100); B = CUDA.zeros(100, 100, 100);
julia> @time CUDA.@sync mycopy!(A, B)
4.850064 seconds (4.68 M allocations: 321.383 MiB, 1.56% gc time, 98.85% compilation time: 5% of which was recompilation)
CUDA.HostKernel for kernel()
julia> @time CUDA.@sync mycopy!(A, B)
0.020466 seconds (63 allocations: 4.391 KiB)
CUDA.HostKernel for kernel()
REPL, after precompilation:
(@v1.10) pkg> activate Startup
Activating project at `(...)\Startup`
(Startup) pkg> ^C
julia> using Startup
julia> using CUDA
julia> A = CUDA.zeros(100, 100, 100); B = CUDA.zeros(100, 100, 100);
julia> @time CUDA.@sync mycopy!(A, B)
1.263689 seconds (1.30 M allocations: 90.532 MiB, 1.16% gc time, 95.33% compilation time)
CUDA.HostKernel for kernel()
julia> @time CUDA.@sync mycopy!(A, B)
0.000207 seconds (25 allocations: 1.172 KiB)
CUDA.HostKernel for kernel()
module Test
using PrecompileTools: @setup_workload, @compile_workload
using CUDA
export mycopy!
function mycopy!(A, B)
len = length(A)
@assert len === length(B)
function kernel()
i = threadIdx().x + blockDim().x * (blockIdx().x - 1)
if checkbounds(Bool, A, i) && checkbounds(Bool, B, i)
A[i] = B[i]
threads = 512
blocks = cld(len, threads)
@cuda threads = threads blocks = blocks kernel()
@setup_workload begin
A = CUDA.zeros(100, 100, 100)
B = CUDA.zeros(100, 100, 100)
@compile_workload begin
CUDA.@sync mycopy!(A, B)
REPL, after precompilation:
(@v1.10) pkg> activate Test
Activating project at `(...)\Test`
(Test) pkg> ^C
julia> using Test
julia> using CUDA
julia> A = CUDA.zeros(100, 100, 100); B = CUDA.zeros(100, 100, 100);
julia> @time CUDA.@sync mycopy!(A, B)
1.223230 seconds (1.30 M allocations: 90.564 MiB, 1.14% gc time, 94.89% compilation time)
CUDA.HostKernel for kernel()
julia> @time CUDA.@sync mycopy!(A, B)
0.000208 seconds (25 allocations: 1.172 KiB)
CUDA.HostKernel for kernel()
Version info
julia> versioninfo()
Julia Version 1.10.4
Commit 48d4fd4843 (2024-06-04 10:41 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
(Test) pkg> st
Project Test v0.1.0
Status `(...)\Test\Project.toml`
[052768ef] CUDA v5.5.2
[aea7be01] PrecompileTools v1.2.1
CUDA libraries:
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+560.94
Julia packages:
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.3+0
- CUDA_Runtime_jll: 0.15.3+0
- Julia: 1.10.4
- LLVM: 15.0.7
1 device:
0: NVIDIA GeForce RTX 3070 (sm_86, 6.424 GiB / 8.000 GiB available)
Thank you. But I think it is not enough. Many applications will most likely call more than one kernel. One call takes one second, but with 20, it can take as long as 20 seconds.
To measure performance with MPI, I run julia ***.jl many times, each time changing the number of processes. Each time it takes so long to compile that I think it has frozen.
Here’s a way to use a cached precompiled kernel. It assumes you have ptxas in your PATH.
In all likelihood there exist more appropriate methods in CUDA.jl. For example, I’m pretty sure cudacall should be preferred over my HostKernel approach, but I couldn’t get it to work. Check out
for some more information.
module Test
using PrecompileTools: @setup_workload, @compile_workload
using CUDA
using CUDA: i32
export mycopy!
function kernel!(A, B)
i = threadIdx().x + blockDim().x * (blockIdx().x - 1i32)
if checkbounds(Bool, A, i) && checkbounds(Bool, B, i)
A[i] = B[i]
function get_func_name_from_ptx(ptx_path)
# (Can be written more efficiently)
ptx_code = read(ptx_path, String)
return ptx_code[findfirst(Regex("// -- Begin function .*\n"), ptx_code)[begin + length("// -- Begin function "):end-1]]
# There is a comment "// -- Begin function <func_name>" which ends with a newline
function get_compiled_kernel(cubin_path, name, kernel_tt)
mdl = CuModule(read(cubin_path))
func = CuFunction(mdl, name)
return CUDA.HostKernel{typeof(kernel!), kernel_tt}(kernel!, func, CUDA.KernelState(CUDA.create_exceptions!(mdl), UInt32(0)))
function mycopy!(A, B, cache_dir="E:/Temp/kernel/") # Adjust default
len = length(A)
@assert len === length(B)
threads = 512
blocks = cld(len, threads)
ptx_path = joinpath(cache_dir, "kernel.ptx")
cubin_path = joinpath(cache_dir, "kernel.cubin")
func_name_path = joinpath(cache_dir, "name.txt")
if !ispath(ptx_path)
# Compile to disk
open(ptx_path, "w") do io
@device_code_ptx io @cuda threads=threads blocks=blocks kernel!(A, B)
func_name = get_func_name_from_ptx(ptx_path)
open(func_name_path, "w") do io
write(io, func_name)
sm = CUDA.capability(CUDA.CuDevice(0))
run(`ptxas -arch=sm_$(sm.major)$(sm.minor) --output-file $cubin_path $ptx_path`) # compile ptx to cubin
# kernel.ptx, kernel.cubin and name.txt should now exist
func_name = read(func_name_path, String)
kernel_args = map(CUDA.cudaconvert, (A, B)) # (cf. @cuda macro)
kernel_tt = Tuple{map(Core.Typeof, kernel_args)...}
kern = get_compiled_kernel(cubin_path, func_name, kernel_tt)
kern(A, B, threads=threads, blocks=blocks)
@setup_workload begin
A = CUDA.zeros(100, 100, 100)
B = CUDA.zeros(100, 100, 100)
@compile_workload begin
CUDA.@sync mycopy!(A, B)
REPL output:
(@v1.10) pkg> activate Test
Activating project at `(...)\Test`
(Test) pkg> ^C
julia> using Test
Precompiling Test
1 dependency successfully precompiled in 11 seconds. 69 already precompiled.
[ Info: Precompiling Test [98d22206-c062-46eb-91c7-b4da6428f19f]
julia> using CUDA
julia> A = CUDA.zeros(100, 100, 100); B = CUDA.ones(100, 100, 100);
julia> @time CUDA.@sync mycopy!(A, B)
0.001473 seconds (115 allocations: 1.007 MiB)
julia> @time CUDA.@sync mycopy!(A, B)
0.000652 seconds (115 allocations: 1.007 MiB)
julia> CUDA.@allowscalar A[1]
You could make this a bit more efficient in subsequent runs by keeping mdl and func_name in memory and not rereading it from the disk every time. But in the grand scheme of things, I assume this will be negligible.