Hi folks,

I was really frustrated last week because it seems there isn’t a way to implement Gaussian processes with both CUDA acceleration and differentiation. Because of this, I had to roll my own CUDA kernels that did the job. Here is a link to the github repo.

The code is minimal enough to be self-contained in a single file and I think people would find it easy to extend and modify for their use-case until we get the mainstream libraries to do something about GPUs.

Here is a simple benchmark:

The error bars are the 80% empirical quantiles while `N`

is the number of data points.

The repo basically contains the routines needed to compute the likelihood of GPs as:

```
function gp_likelihood(
X_dev::CUDA.CuArray{<:Real,2},
y_dev::CUDA.CuArray{<:Real,1},
σ²::Real,
ϵ²::Real,
ℓ²_dev::CUDA.CuArray{<:Real,1},
)
n_data = size(X_dev, 2)
R = distance_matrix_gpu(X_dev, X_dev, ℓ²_dev)
K = matern52_gpu(R)
K_ϵ = eltype(K)(σ²) * K + eltype(K)(ϵ²) * I
K_chol = cholesky(K_ϵ; check = false)
if issuccess(K_chol)
L⁻¹y = K_chol.L \ y_dev
yᵀΣ⁻¹y = dot(L⁻¹y, L⁻¹y)
logdet = 2 * sum(log.(Array(diag(K_chol.U))))
(yᵀΣ⁻¹y + logdet + n_data * log(2 * π)) / -2
else
-Inf
end
end
```

For the covariance kernels, it contains the implementation of two types:

the squared-exponential and Matern 5/2 kernels with automatic relevance determination (ARD).

I literally put almost zero effort into optimizing the CUDA kernels, so feel free to suggest possible optimizations. The kernels don’t even use shared memory. But, so far, I’m quite impressed that such a zero effort attempt achieves quite acceptable performance. At least, on my hardware, this is pretty close to the best performance I can expect.

Hope someone finds this useful!