Most efficient way to find cholesky decomposition of slices of a 3D array in KernelAbstractions

I am using KernelAbstractions.jl to write a machine learning package that is device-agnostic. At one point in my algorithm, I end up with a NxNxn tensor, let’s call it T, on the device. I need to create a new vector, let’s call it L, that contains the lower triangular component of the cholesky decomposition of the slice i: L[i] = cholesky(T[:,:,i]).L.data. Is there a way to implement this in KernelAbstractions.jl that doesn’t involve me implementing my own cholesky kernel? For clarity, here is a sequential example of the full problem:

logp = Vector{Float32}(undef, n)
for k in 1:n
    L = cholesky(T[:,:,k])
    a = L.U \ (L.L \ y)
    logp[k] = -0.5 * y' * a - tr(L.L) - N/2 log(2pi)
end
return logp

Thank you!

We have a dense method cholesky for NVIDIA and AMD arrays.
It should work if you just rely on the multiple dispatch of cholesky and ldiv!.

Btw, last time I checked L.L creates a copy. You can avoid this by UpperTriangular(L.factors)' (if L.uplo=='U') or LowerTriangular(L.factors) (if L.uplo==‘L’)

2 Likes

Sorry, I should have made my question a little more clear. I have already been running my code on GPU using multiple dispatch, I’m just concerned about the latency from spinning up n kernels. In my case, n is on the order of 10,000, while N is only on the order of 100, so I’m worried that launching the kernels is creating a lot of overhead. I guess the better question would have been this:

Is there some way to fuse the kernels for all n cholesky calls? In the KernelAbstractions.jl docs it just has an empty section for fusing kernels, so I was wondering if there was some undocumented way to go about this.

Thanks for your help!

This is a good suggestion, thanks! I’ll check the performance improvement on this