I am using KernelAbstractions.jl to write a machine learning package that is device-agnostic. At one point in my algorithm, I end up with a NxNxn tensor, let’s call it `T`

, on the device. I need to create a new vector, let’s call it `L`

, that contains the lower triangular component of the cholesky decomposition of the slice `i`

: `L[i] = cholesky(T[:,:,i]).L.data`

. Is there a way to implement this in `KernelAbstractions.jl`

that doesn’t involve me implementing my own cholesky kernel? For clarity, here is a sequential example of the full problem:

```
logp = Vector{Float32}(undef, n)
for k in 1:n
L = cholesky(T[:,:,k])
a = L.U \ (L.L \ y)
logp[k] = -0.5 * y' * a - tr(L.L) - N/2 log(2pi)
end
return logp
```

Thank you!

We have a dense method `cholesky`

for NVIDIA and AMD arrays.

It should work if you just rely on the multiple dispatch of `cholesky`

and `ldiv!`

.

Btw, last time I checked `L.L`

creates a copy. You can avoid this by `UpperTriangular(L.factors)'`

(if `L.uplo=='U'`

) or `LowerTriangular(L.factors)`

(if L.uplo==‘L’)

2 Likes

Sorry, I should have made my question a little more clear. I have already been running my code on GPU using multiple dispatch, I’m just concerned about the latency from spinning up n kernels. In my case, n is on the order of 10,000, while N is only on the order of 100, so I’m worried that launching the kernels is creating a lot of overhead. I guess the better question would have been this:

Is there some way to fuse the kernels for all n cholesky calls? In the KernelAbstractions.jl docs it just has an empty section for fusing kernels, so I was wondering if there was some undocumented way to go about this.

Thanks for your help!

This is a good suggestion, thanks! I’ll check the performance improvement on this