CUDA.jl kernel is half as fast as c++ Kernel

nsajko · September 26, 2022, 2:14am

Yes. Now I looked into what CUDA.jl was doing, and it seems they actually call NVIDIA’s LLVM bitcode library called libdevice. They do this for all of sincos, sin and cos. Libdevice does it’s own range reduction and everything else, so most of what I said doesn’t apply.
This however doesn’t explain the “references to double in the IR that aren’t there when using sin and cos”, that’s weird and interesting.

NB: this is the Base.sincos(::Float32) code for CUDA.jl:

github.com

JuliaGPU/CUDA.jl/blob/79d84687a7ecb70a65b04e0bfb452a7bdf0360cd/src/device/intrinsics/math.jl#L37-L42


      
          @device_override function Base.sincos(x::Float32)
              s = Ref{Cfloat}()
              c = Ref{Cfloat}()
              ccall("extern __nv_sincosf", llvmcall, Cvoid, (Cfloat, Ptr{Cfloat}, Ptr{Cfloat}), x, s, c)
              return (s[], c[])
          end

Topic		Replies	Views
Julia vs C++ speed General Usage performance , c	21	4769	September 2, 2021
Cosine seems slow Performance	14	1878	November 27, 2019
Why is my kernel as slow in FP32 as in FP64 on A2000 Ada-based GPU? New to Julia gpu , cuda , float , kernel , cudajl	10	269	March 11, 2025
Trying to understand low performance compared to C++ Performance	13	425	October 2, 2024
Trig functions very slow Performance	67	7178	October 10, 2018

CUDA.jl kernel is half as fast as c++ Kernel

Related topics