Hi, everyone.

I have a question about the usage of LSTM and other RNN cells with Flux for prediction of data sequences. I’m playing around with Kaggle competition data (Google Brain - Ventilator Pressure Prediction | Kaggle). The data is in a format of time sequence, given 80 sequence values of time, inspiratory volume and some other variables, the task is to predict pressure at each time step. Therefore, the task is to learn a sequence to sequence mapping. I’m currently using a sequence of size 2 features x 80 datapoints to predict a sequence of size (1,80).

Since the data is sequential, I thought about using RNNs, namely LSTMs. However, the results I get are terrible and I would appreciate any insight about what I might be doing wrong.

I am comparing the results with two models

- very simple point network: treat each column of the input matrix as a datapoint and feed it to a simple neutral network (e. g.
`model = Chain(Dense(in, h), Dense(h, h, relu), Dense(h, out)`

) and predict a single pressure value - a simple vector network: reshape the matrix to be a single vector (same as with MNIST when we reshape the 28x28 matrix to 784x1 vector) and train it to predict a pressure vector of the size of the sequence (80,1)

The 1. model is able to quickly achieve score approximately 4.5, the second is able to get as low as 1.0 score.

And now to the LSTM. I have inputs as matrices, X is of shape (2,n) where n varies. The output is a vector of shape (1,n). I created the model as `model = Chain(LSTM(idim, hdim), LSTM(hdim, hdim), Dense(hdim, odim))`

, hdim = 32. So far I have only achieved score of 4.3 and the loss is not converging anymore.

I have a loss function in the form of

```
function loss(_X, _P)
b = _X[3,:]
X = _X[[1,2,4,5], b .== 0]
P = _P[:, b .== 0]
pred = model(X)
Flux.reset!(model)
Flux.mae(pred, P)
end
```

where b is an indicator vector of which part of the sequence I use for training. The training is done simply with `ADAM()`

optimizer as `Flux.train!(loss, Flux.params(model), zip(batch...),opt)`

using minibatches of size 64.

I am not sure what I might be doing wrong, because it does not make sense to me that LSTM network would perform so much worse than a simple multilayer perceptron. Any ideas? Do I need more layers, more neurons? Or am I inputting the data wrong? Thanks for any help.