I’m trying to work with a dataset with missing data using the CUDA.jl. I’d like to skip the missing data in some statistics (mean, covariance etc). Is there any general solution to this that performs well on a GPU? I have written custom kernels to do this but I think I may be missing something and there could be a cleaner way to do it.
I’d like to be able to write something short like:
a = rand(1000,1000) #random data
a[ a.>0.5 ] .= NaN # simulate missing data using nan since no missings support
a = cu(a)
mapslices( x->mean(filter(!isnan,x)),a,dims=1) #average non-missing data in each colum
This works but is extremely slow on the GPU. Everything I have tried is either very slow or won’t compile for the GPU. Is there a reasonably fast way to do this without custom kernels?
I don’t have a GPU to test this right now, but
mapreduce should be fast on CUDA arrays, so things like
reduce(a, dims=1) do acc, val
isnan(val) ? acc : val + acc
should be an efficient way to filter NaNs out in a sum. For more complex statistics, it’s worth checking if it’s easy to get them using say Transducers or OnlineStats. In principle both packages should work with
reduce and thus be GPU compatible (haven’t checked though).
CUDA.allowscalar(false), which will probably reveal that
mapslices isn’t available for CuArray. Generally,
missing isn’t supported either since
CuArray doesn’t support Union eltypes.
What @piever mentions should work though.
Thanks, I knew missing wasn’t supported. I’ll turn that switch on, I suspected that is why it was slow but never checked.
Thanks this does work and gives me a good starting point. I did need to add an init to your example.
I’m familiar with OnlineStats, I’ll try that with reduce. I’ll also look into Transducers, I’m not familiar with that one.