A minimal CUDA-compatible and differentiable GP implementation

Hi folks,

I was really frustrated last week because it seems there isn’t a way to implement Gaussian processes with both CUDA acceleration and differentiation. Because of this, I had to roll my own CUDA kernels that did the job. Here is a link to the github repo.

The code is minimal enough to be self-contained in a single file and I think people would find it easy to extend and modify for their use-case until we get the mainstream libraries to do something about GPUs.

Here is a simple benchmark:

plot

The error bars are the 80% empirical quantiles while N is the number of data points.

The repo basically contains the routines needed to compute the likelihood of GPs as:

function gp_likelihood(
    X_dev::CUDA.CuArray{<:Real,2},
    y_dev::CUDA.CuArray{<:Real,1},
    σ²::Real,
    ϵ²::Real,
    ℓ²_dev::CUDA.CuArray{<:Real,1},
)
    n_data = size(X_dev, 2)
    R      = distance_matrix_gpu(X_dev, X_dev, ℓ²_dev)
    K      = matern52_gpu(R)
    K_ϵ    = eltype(K)(σ²) * K + eltype(K)(ϵ²) * I
    K_chol = cholesky(K_ϵ; check = false)

    if issuccess(K_chol)
        L⁻¹y = K_chol.L \ y_dev
        yᵀΣ⁻¹y = dot(L⁻¹y, L⁻¹y)
        logdet = 2 * sum(log.(Array(diag(K_chol.U))))
        (yᵀΣ⁻¹y + logdet + n_data * log(2 * π)) / -2
    else
        -Inf
    end
end

For the covariance kernels, it contains the implementation of two types:

\begin{align} k\left(\mathbf{x}, \mathbf{y}\right) &= \sigma^2 k_{\text{SE ARD}}\left(\mathbf{x}, \mathbf{y} \right) + \epsilon^2 \newline k\left(\mathbf{x}, \mathbf{y}\right) &= \sigma^2 k_{\text{Matern 5/2 ARD}}\left(\mathbf{x}, \mathbf{y} \right) + \epsilon^2, \end{align}

the squared-exponential and Matern 5/2 kernels with automatic relevance determination (ARD).

I literally put almost zero effort into optimizing the CUDA kernels, so feel free to suggest possible optimizations. The kernels don’t even use shared memory. But, so far, I’m quite impressed that such a zero effort attempt achieves quite acceptable performance. At least, on my hardware, this is pretty close to the best performance I can expect.

Hope someone finds this useful!

14 Likes

This is cool! Thanks for working through this example. I think everyone involved in the JuliaGPs org recognises this as a problem, we’ve just not had the time to sort it out yet. I created a MWE JuliaGPs version of this a while ago, but never had the bandwidth to do anything more with it: https://gist.github.com/willtebbutt/cc268e614d2b73156f5bc24e7408613d

Fortunately, it looks like we need very little extra code in order to make this work. I think you’re correct in that the main thing we’re missing is GPU compatible, Zygote-differentiable, kernels. AbstractGPs needs a couple of small tweaks, but they would be really easy to do. If you’re keen to see this kind of thing happen, it will be really quite straightforward to extend KernelFunctions + AbstractGPs to support GPU computation – we would be very open to you opening an issue on AbstractGPs.jl to discuss how to move forward.

Re your blog post: the whole should-we-use-Tullio discussion is not central here really, and I think it’s a bit of a distraction – it would have been nice if it worked out and solved all of our problems, but it doesn’t look like it will on its own. All that we really need to do is ensure that we have GPU-friendly distance calculations (we define our own function pairwise, that usually dispatches to Distances.pairwise, but doesn’t have to, so we have the freedom to provide our own GPU-friendly code), and fix up any other code we’ve written which doesn’t play nicely with CUDA.

edit: the other roadblock is CI. I don’t think anyone in the JuliaGPs org presently has experience getting GPU-enabled CI up and running. Again, very doable, just needs someone to do it.

2 Likes

See GitHub - JuliaGPU/buildkite and contact a Buildkite admin on Slack to set up things.

1 Like

I went through the buildkite setup process for OptimalTransport and TuringTutorials :wink:

1 Like

I stand corrected!

Hi, thanks for your interest. Didn’t know about you’re MWE. It’s pretty cool especially since it’s really minimal!

Fortunately, it looks like we need very little extra code in order to make this work. I think you’re correct in that the main thing we’re missing is GPU compatible, Zygote-differentiable, kernels. AbstractGPs needs a couple of small tweaks, but they would be really easy to do. If you’re keen to see this kind of thing happen, it will be really quite straightforward to extend KernelFunctions + AbstractGPs to support GPU computation – we would be very open to you opening an issue on AbstractGPs.jl to discuss how to move forward.

Re your blog post: the whole should-we-use-Tullio discussion is not central here really, and I think it’s a bit of a distraction – it would have been nice if it worked out and solved all of our problems, but it doesn’t look like it will on its own. All that we really need to do is ensure that we have GPU-friendly distance calculations (we define our own function pairwise , that usually dispatches to Distances.pairwise , but doesn’t have to, so we have the freedom to provide our own GPU-friendly code), and fix up any other code we’ve written which doesn’t play nicely with CUDA.

Would specializing pairwise support all of the kernel combinations offered by KernelFunctions.jl though? If that’s the case, I would be happy to contribute to AbstractGPs.jl.

Would specializing pairwise support all of the kernel combinations offered by KernelFunctions.jl though? If that’s the case, I would be happy to contribute to AbstractGPs.jl .

I believe it would get us most of the way. For example, all of our subtypes of SimpleKernel have the same basic structure for the kernelmatrix function:

  1. compute distance between all pairs of inputs,
  2. apply RealReal function to each element of the distance.

The first step is the bit that requires specialisation, while the second ought to be fairly straightforward to handle because it’s just broadcasting a scalar function. Our various composite kernels (sums of kernels, scalings of kernels, pre-composition by another function etc) should then be relatively straightforward to get to work.

I’m sure there will be a couple of rough edges, but I believe the above to be broadly true.

One way that we could get started would be to get a single kernel working on the GPU + get CI set up in KernelFunctions, then we can set up GPU CI in AbstractGPs. Once we’ve got this initial pipeline working, it should be easy build towards complete GPU coverage.

2 Likes

That indeed seems to feasible. But do you envision the overloads to go into AbstractGPs.jl? I wonder it would make more sense to see them in KernelFunctions.jl

I would want to put them wherever the current functionality exists. So, for example, the distances stuff would go in KernelFunctions, while anything that needed to be changed in logpdf or rand would live in AbstractGPs.

Updates on the issues around GPs:

  • The performance bugs with CUDA.jl Triangular matrices have been handled by this PR
  • The compiler error for differentiating through the Cholesky with CuArray is currently being worked on. For Zygote see this PR, and for ChainRules.jl see this issue.
2 Likes

I have been meaning to comment on this thread. Really nice work, and I also like the intro about the state of the art in your blog post. If we manage to make the changes to integrate this functionality into the JuliaGP ecosystem, that would be a large leap forward.

Have you done any experiments with variational inference/non-Gaussian likelihoods?

1 Like

Hi, yes, VI and non-conjugate likelihoods were actually the use cases I had in mind (I was writing a VI paper using the same code). However, the more advanced stuff like sparse variational GPs is a different story. But it seems the ApproximateGPs folks are currently working GPU support, so hopefully, we’ll get that running on GPUs too.