# Most efficient way to find cholesky decomposition of slices of a 3D array in KernelAbstractions

I am using KernelAbstractions.jl to write a machine learning package that is device-agnostic. At one point in my algorithm, I end up with a NxNxn tensor, let’s call it `T`, on the device. I need to create a new vector, let’s call it `L`, that contains the lower triangular component of the cholesky decomposition of the slice `i`: `L[i] = cholesky(T[:,:,i]).L.data`. Is there a way to implement this in `KernelAbstractions.jl` that doesn’t involve me implementing my own cholesky kernel? For clarity, here is a sequential example of the full problem:

``````logp = Vector{Float32}(undef, n)
for k in 1:n
L = cholesky(T[:,:,k])
a = L.U \ (L.L \ y)
logp[k] = -0.5 * y' * a - tr(L.L) - N/2 log(2pi)
end
return logp
``````

Thank you!

We have a dense method `cholesky` for NVIDIA and AMD arrays.
It should work if you just rely on the multiple dispatch of `cholesky` and `ldiv!`.

Btw, last time I checked `L.L` creates a copy. You can avoid this by `UpperTriangular(L.factors)'` (if `L.uplo=='U'`) or `LowerTriangular(L.factors)` (if L.uplo==‘L’)

2 Likes

Sorry, I should have made my question a little more clear. I have already been running my code on GPU using multiple dispatch, I’m just concerned about the latency from spinning up n kernels. In my case, n is on the order of 10,000, while N is only on the order of 100, so I’m worried that launching the kernels is creating a lot of overhead. I guess the better question would have been this:

Is there some way to fuse the kernels for all n cholesky calls? In the KernelAbstractions.jl docs it just has an empty section for fusing kernels, so I was wondering if there was some undocumented way to go about this.

Thanks for your help!

This is a good suggestion, thanks! I’ll check the performance improvement on this