I’m currently working on the prediction of chaotic data and I have decided to see how well would an RNN, namely an LSTM, would do. I am fairly new to the topic of Neural Networks, but I have found a spate of helpful resources along the way, including this forum.
One aspect that I am still having trouble with is how to structure the layers of an RNN? By “to structure” I mean the number of layers and their ordering. For instance, on this Julia discourse post, we can see the following RNN model
function LSTM_model(N,num_of_classes)
model = Chain(LSTM(N,200),
Dropout(0.2),
LSTM(200,200),
Dropout(0.1),
Dense(200,101),
Dropout(0.1),
Dense(101,num_of_classes))
return model
end
The author adopted a structure to the model used for time-series prediction, the same application I’m working on. But what makes it better to have two LSTM layers instead of one? Or why is there a Dropout() after every layer? Why have the LSTM layers before the Dense layers?
On the “Julia for Optimization and Learning” online course maintained by the Czech Technical University, there is a section (that might be a bit outdated) dedicated to training and storing a network. They create a model with Chain
and describe each layer’s role:
m = Chain(
Conv((2,2), 1=>16, relu),
MaxPool((2,2)),
Conv((2,2), 16=>8, relu),
MaxPool((2,2)),
flatten,
Dense(288, size(y_train,1)),
softmax,
)
- Two convolutional layers extract low-level features from the images.
- Two pooling layers reduce the size of the previous layer.
- One flatten layer converts multi-dimensional arrays into one-dimensional vectors.
- One dense layer is usually applied at the end of the chain.
- One softmax layer is usually the last one and results in probabilities.
So far, this has been the only resource I’ve found that gives some reasoning to their choice of layers and their ordering.
In my case, I have been trying different architectures with a different number of hidden layers, a mix of LSTM()
and Dense()
, but results have been lacking. Granted, the data I am working with behave chaotically (as if from a Lévy Process). But I was hoping that with nonlinear activation functions, the model would capture features other “simpler” models, like a Recursive Least Squares, wouldn’t. But so far, the average RMSE for a series of sequential predictions has been nearly the same for an LSTM model and a Recursive Least Squares.
Are there guidelines or rules of thumb on how to compose the hidden layers (which ones, their ordering, how many) in a RNN when the goal is to make time-series predictions?