CUDA.jl with missing data?

I don’t have a GPU to test this right now, but reduce and mapreduce should be fast on CUDA arrays, so things like

reduce(a, dims=1) do acc, val
    isnan(val) ? acc : val + acc
end

should be an efficient way to filter NaNs out in a sum. For more complex statistics, it’s worth checking if it’s easy to get them using say Transducers or OnlineStats. In principle both packages should work with reduce and thus be GPU compatible (haven’t checked though).