I am using KernelAbstractions.jl to write a machine learning package that is device-agnostic. At one point in my algorithm, I end up with a NxNxn tensor, let’s call it T
, on the device. I need to create a new vector, let’s call it L
, that contains the lower triangular component of the cholesky decomposition of the slice i
: L[i] = cholesky(T[:,:,i]).L.data
. Is there a way to implement this in KernelAbstractions.jl
that doesn’t involve me implementing my own cholesky kernel? For clarity, here is a sequential example of the full problem:
logp = Vector{Float32}(undef, n)
for k in 1:n
L = cholesky(T[:,:,k])
a = L.U \ (L.L \ y)
logp[k] = -0.5 * y' * a - tr(L.L) - N/2 log(2pi)
end
return logp
Thank you!
We have a dense method cholesky
for NVIDIA and AMD arrays.
It should work if you just rely on the multiple dispatch of cholesky
and ldiv!
.
Btw, last time I checked L.L
creates a copy. You can avoid this by UpperTriangular(L.factors)'
(if L.uplo=='U'
) or LowerTriangular(L.factors)
(if L.uplo==‘L’)
2 Likes
Sorry, I should have made my question a little more clear. I have already been running my code on GPU using multiple dispatch, I’m just concerned about the latency from spinning up n kernels. In my case, n is on the order of 10,000, while N is only on the order of 100, so I’m worried that launching the kernels is creating a lot of overhead. I guess the better question would have been this:
Is there some way to fuse the kernels for all n cholesky calls? In the KernelAbstractions.jl docs it just has an empty section for fusing kernels, so I was wondering if there was some undocumented way to go about this.
Thanks for your help!
This is a good suggestion, thanks! I’ll check the performance improvement on this