Dear all,

I have written a simple Euler stepping algorithm for a particular problem which runs on GPU. The vector `y_current`

is big (~200_000) and it is updated like `y_current .= y_current .+ dt * RHS`

. A working example can be found here. It was done in ArrayFire. I have a working example in CLArrays but it is 10x slower.

I can’t save every step in a matrix because I need a lot of steps. Instead, I want to record `sum(y_current)`

at every step. However, including a save mechanism slows down the whole thing 50 times and it is not useful then.

Hence, I am doing something like

```
for ii=1:10000
updateEuler!(y_current,dt)
#save data, a is own by the GPU
a[ii] = sum(y_current)
end
```

Not being written in an array fashion, this is slow indeed.

Does anyone ones a trick to save the sum value by staying on the GPU? May be one has to write a specific kernel…

Thank you for your help and suggestions,

Best regards,