I’m trying to work with a dataset with missing data using the CUDA.jl. I’d like to skip the missing data in some statistics (mean, covariance etc). Is there any general solution to this that performs well on a GPU? I have written custom kernels to do this but I think I may be missing something and there could be a cleaner way to do it.
I’d like to be able to write something short like:
using CUDA
a = rand(1000,1000) #random data
a[ a.>0.5 ] .= NaN # simulate missing data using nan since no missings support
a = cu(a)
mapslices( x->mean(filter(!isnan,x)),a,dims=1) #average non-missing data in each colum
This works but is extremely slow on the GPU. Everything I have tried is either very slow or won’t compile for the GPU. Is there a reasonably fast way to do this without custom kernels?
I don’t have a GPU to test this right now, but reduce and mapreduce should be fast on CUDA arrays, so things like
reduce(a, dims=1) do acc, val
isnan(val) ? acc : val + acc
end
should be an efficient way to filter NaNs out in a sum. For more complex statistics, it’s worth checking if it’s easy to get them using say Transducers or OnlineStats. In principle both packages should work with reduce and thus be GPU compatible (haven’t checked though).
You should CUDA.allowscalar(false), which will probably reveal that mapslices isn’t available for CuArray. Generally, missing isn’t supported either since CuArray doesn’t support Union eltypes.