Why functions in SpecialFunctions package work on CUDA arrays?

HI. I am confused how the behavior of functions inside the SpecialFunctions package on CUDA arrays. I am confused why the code below works and doesn’t throw an error at the last line.


using CUDA
using SpecialFunctions
x = rand(Float32, 64, 64, 25)
x_gpu = cu(x)
z = loggamma.(x_gpu)

EDIT: I realized after writing this post that CUDA comes with its own loggamma implementation, so what I wrote below is inaccurate: the implementation details of SpecialFunctions.loggamma don’t matter. See the next post for more details. I’m leaving the remainder of this post unchanged, so just imagine that we’re discussing some generic Julia function, and then read the next post to learn how this simplifies in the case of loggamma.


The first thing to realize is that many functions in SpecialFunctions.jl, including loggamma, are implemented in Julia and do not call out to a compiled C library. You can see the guts of loggamma(::Float32) here: SpecialFunctions.jl/src/logabsgamma/e_lgammaf_r.jl at v2.5.1 · JuliaMath/SpecialFunctions.jl · GitHub.

Another important detail is that loggamma does not allocate any intermediate arrays. It’s just a sequence of arithmetic and binary operations applied to plain numbers.

So your code works for exactly the same reason that the following works:

using CUDA
x = rand(Float32, 64, 64, 25)
x_gpu = cu(x)
f(x) = log(2cosh(x))
z = f.(x_gpu)

You’re just using a slightly more complicated f.


The obvious follow-up question is, well, so why does that work? I can give you a rough idea, but will have to defer to the experts for details.

  • Think about what broadcasting does on a regular CPU array. Essentially, z_cpu = loggamma.(x) would be translated into a loop like the following:
    function loggamma_broadcast!(output, input)
        for i in eachindex(output, input)
            output[i] = loggamma(input[i])
        end
    end
    z_cpu = similar(x)
    loggamma_broadcast!(z_cpu, x)
    z_cpu
    
  • GPU arrays have their own implementation of the broadcasting machinery, which writes GPU kernels to execute the broadcasted operations. Your example z = loggamma.(x_gpu) is essentially translated into something like the following (don’t worry about the details here; the point is that this also looks like a regular Julia function that calls loggamma on elements of input):
    function loggamma_broadcast_cuda!(output, input)
        index = threadIdx().x
        stride = blockDim().x
        for i in index:stride:length(output)
            @inbounds output[i] = loggamma(input[i])
        end
        return nothing
    end
    z = similar(x_gpu)
    @cuda threads=256 loggamma_broadcast_cuda!(z, x_gpu)
    z
    
    To read more about how to write and launch CUDA kernels in Julia, see Introduction · CUDA.jl
  • The magic of CUDA.jl (with the help of GPUCompiler.jl) is that regular Julia code such as this, which calls other regular Julia functions such as loggamma, can be compiled to CUDA kernels that run on the GPU. Someone else will have to explain the details of how that works. (I’ve seen terms like method overlay tables thrown around, and I know that LLVM has a backend for emitting Nvidia PTX assembly: https://llvm.org/docs/NVPTXUsage.html.)

P.S.: In reality, GPU array broadcasting uses KernelAbstractions.jl to produce device-agnostic kernels that work on many different GPUs, not just Nvidia ones. KernelAbstractions.jl then produces CUDA.jl kernel code like the above. I skipped this extra layer of magic to show more directly how loggamma is used in the code that CUDA.jl compiles to the GPU. However, the KernelAbstractions.jl version is much simpler; it looks something like this:

@kernel function loggamma_broadcast_KA!(output, input)
    I = @index(Global)
    @inbounds output[I] = loggamma(input[I])
end
6 Likes

Oops! In the case of loggamma, what I said here is actually beside the point. When compiling for the GPU, CUDA.jl obviously needs to replace Julia functions with GPU native functions at some point—if not before, then at least at the level of hardware instructions like + and *. But Nvidia provides its own library of math functions for their GPUs, including things like sin and exp and, you guessed it, loggamma. So when CUDA.jl is compiling loggamma_kernel_cuda! and encounters SpecialFunctions.loggamma, it simply replaces this call with the GPU native loggamma. It never looks at the implementation of SpecialFunctions.loggamma.

This is an example of the method overlay table in action. You can see how the override for SpecialFunctions.loggamma is registered here: CUDA.jl/ext/SpecialFunctionsExt.jl at v5.7.3 · JuliaGPU/CUDA.jl · GitHub

In the example I provided for comparison, with f(x) = log(2cosh(x)), there are overrides for log and cosh, but obviously not for the function f that I defined myself, so this is actually a better example of CUDA.jl being able to compile GPU kernels from Julia functions that call other Julia functions.

1 Like

Halfway through the first comment, I was about to say wasn’t this an issue when SpecialFunctions deprecated lgamma for loggamma and CUDA.jl didn’t know to use the CUDA C intrinsics for and failed to compile loggamma. But looking at it more closely, it failed because lgamma/loggamma was ccall-ing routines for Float32/Float64/BigFloat and converting other Reals to and from those. Currently loggamma for Float32/Float64 uses pure Julia ports of OpenLibm’s algorithms, I wonder if CUDA.jl could compile those if it weren’t using CUDA C intrinsics.

1 Like

Indeed, it can:

julia> using CUDA

julia> using IrrationalConstants:
           twoπ,
           halfπ,
           sqrtπ,
           sqrt2π,
           invπ,
           inv2π,
           invsqrt2,
           invsqrt2π,
           logtwo,
           logπ,
           log2π

julia> function logabsgamma(x::Float32)
           # [...]
           # implementation copied from https://github.com/JuliaMath/SpecialFunctions.jl/blob/v2.5.1/src/logabsgamma/e_lgammaf_r.jl
           # [...]
       end
logabsgamma (generic function with 1 method)

julia> function loggamma(x::Float32)
           # just skipping the f(x::Real) = f(float(x)) wrappers
           # otherwise identical to SpecialFunctions.loggamma
           (y, s) = logabsgamma(x)
           s < 0 && throw(DomainError(x, "`gamma(x)` must be non-negative"))
           return y
       end
loggamma (generic function with 1 method)

julia> x = rand(Float32, 64, 64, 25);

julia> x_gpu = cu(x);

julia> z = loggamma.(x_gpu)  # success!
64×64×25 CuArray{Float32, 3, CUDA.DeviceMemory}:
[...]

More surprisingly, it’s only 10 % slower than the intrinsic version (tested on an NVIDIA A100):

julia> using BenchmarkTools

julia> @btime CUDA.@sync loggamma.($x_gpu);
  40.833 μs (98 allocations: 3.41 KiB)

julia> import SpecialFunctions

julia> @btime CUDA.@sync SpecialFunctions.loggamma.($x_gpu);  # overridden in CUDA.jl
  36.625 μs (98 allocations: 3.41 KiB)

The SpecialFunctions implementation doesn’t look very GPU-friendly with all its branches, but maybe it isn’t possible to do all that much better.

1 Like

those branches are fairly necessary. you can’t cover the domain with a single implementation.

CUDA C’s lgamma and lgammaf have 4 and 6 maximum ULPs error (actually more if inside some intervals), while the SpecialFunction’s loggamma were tested to usually max out at 1.5 ULPs (2.5 for a miniscule fraction of positives, though the underlying logsabsgamma can get up to 3.6e6 ULPs for a small fraction of negatives). I’m however not confident lgamma and lgammaf are the same as CUDA’s libdevice functions __nv_lgamma and __nv_lgammaf, which CUDA.jl uses in SpecialFunctionsExt.

1 Like

Yes, those are from libdevice.

If the additional precision is important, I’d be happy to take a PR that switches from calling libdevice to a native Julia implementation. We generally strive to be compatible with Julia code on the CPU, so that can even come at a slight performance cost (we could always keep the libdevice calls as a _fast version for use with @fastmath, if SpecialFunctions.jl supports that).

1 Like