Say you have a sample of 3 observations-label as a training set for a standard fully connected NN. To parallelize the computation of the loss (and the gradient) of each label, you can feed the three observations to the NN as a matrix instead of sequentially feeding the three observation vectors. Like so:
net = Chain(Dense(10,5,relu),Dense(5,1,relu)) |> gpu
x = cu(rand(10,3)) #each column of that matrix is one observation
y = cu(rand(1,3)) #each element is one label
output = net(x)
loss = Flux.mse(output,y)
and that’s it, the gpu matrix multiplication took care of the parallelization. However as I understand it, a RNN changes the inner state of the neural network depending on the previous inputs of the network. That means that if I input 3 different observations (of potentially different sizes), to parallelize the computation of the gradients, the state of three different networks will have to be kept in memory. Is that not a significant drawback of RNN ? Does Flux implements parallel feedforwards for RNN or do I have to create a gpu kernel or something ?