Sum is very slow (and I can't figure out why)

Thanks, that makes sense…
…but then my measured times are about the same as what I get on CPU and no CUDA (more or less 7ms), and I thought that would be much faster :frowning: