A question for someone who has experience writing/debugging GPU code.
Consider something like this:
using CuArrays, Flux lstm = LSTM(5, 3) |> gpu data = [rand(5) for i=1:10] data = gpu.(data) out = lstm.(data) # SIC!
I need to broadcast the
lstm call in order to make use of stateful properties of RNN.
data is a vanilla Vector (not a
CuArray); each element of
data is a
How will this broadcast be handled by
CUDAnative? Will the RNN be indeed executed on the GPU? Will this cause any unnecessary copying of data from the GPU to CPU and back in between of invocations of individual RNN cells?
Thanks in advance.