How to make this piece of code faster? optimising it or using multiple threads?

Sending data back and forth between cpu and gpu is costly. Why not perform the sum on the gpu?
Also, if you’re going to run stuff on the gpu, I would guess you need to have the entire algorithm on the gpu to avoid excessive data transfer.