Hi,

I want to perform a given computation f1D on all the “rows” of 3D arrays x,y:

```
#2 3D arrays
x=ones(Float32,10,10,5)
y=ones(Float32,10,10,5)
#a function (not parallel)
function f1D!(x1d,y1d)
for k=2:length(x1d)
x1d[k]+=y1d[k]+x1d[k-1]
end
end
# apply f1D to all rows i,j
s=size(x)
for i=1:s[1]
for j=1:s[2]
f1D!(@view(x[i,j,:]),@view(y[i,j,:]))
end
end
```

I wonder if there is a way to express this (efficiently) via a broadcast operator or a map.

The goal is to apply f1d to all the rows in parallel on a GPU with cuArrays via a single kernel.