Flux RNN on a GPU - unnecessary copying

Hello,

A question for someone who has experience writing/debugging GPU code.

Consider something like this:

using CuArrays, Flux
lstm = LSTM(5, 3) |> gpu
data = [rand(5) for i=1:10]
data = gpu.(data)
out = lstm.(data) # SIC!

I need to broadcast the lstm call in order to make use of stateful properties of RNN. data is a vanilla Vector (not a CuArray); each element of data is a CuArray.

How will this broadcast be handled by Flux and CUDAnative? Will the RNN be indeed executed on the GPU? Will this cause any unnecessary copying of data from the GPU to CPU and back in between of invocations of individual RNN cells?

Thanks in advance.