Best practices to reduce startup time for CUDA.jl?

Github Discussion

Hi,

I am looking for best practices to load the CUDA.jl package and the first usage of methods such as CUDA.randn from this package.

I am running CUDA.jl on a cluster node. I am not sure if I am using the package correctly.
When I run the following code, I see a long recompilation time. However the package was precompiled when it was installed.

@time using CUDA
# 3.247396 seconds (9.32 M allocations: 628.931 MiB, 3.77% gc time, 14.07% compilation time: 59% of which was recompilation)
for i in 1:5
    @time x=CUDA.randn(1000,10);
end
# 14.033213 seconds (26.59 M allocations: 1.350 GiB, 3.78% gc time, 38.45% compilation time)
#  0.000127 seconds (57 allocations: 2.578 KiB)
#  0.000069 seconds (57 allocations: 2.578 KiB)
# 0.000027 seconds (57 allocations: 2.578 KiB)
#  0.000021 seconds (57 allocations: 2.578 KiB)

For improving TTFX issues, people usually recommend creating a sysimage using PackageCompiler.jl (link here). However, I’ve had issues with getting a custom sysimage to work specifically with CUDA.jl, but it’s been a while since I last tried, so it may be worth it to try it out.

Also, which version of Julia are you using?

CUDA.jl is known to have a large TTFX because it essentially needs to re-compile the Julia compiler (which is specialized on an AbstractInterpreter argument). Recent advances in precompilation support in base Julia will make it possible to precompile this.