Are there guidelines or rules of thumb on how to stack hidden layers in a RNN?

I’m currently working on the prediction of chaotic data and I have decided to see how well would an RNN, namely an LSTM, would do. I am fairly new to the topic of Neural Networks, but I have found a spate of helpful resources along the way, including this forum.

One aspect that I am still having trouble with is how to structure the layers of an RNN? By “to structure” I mean the number of layers and their ordering. For instance, on this Julia discourse post, we can see the following RNN model

function LSTM_model(N,num_of_classes)
	model = Chain(LSTM(N,200),
	return model

The author adopted a structure to the model used for time-series prediction, the same application I’m working on. But what makes it better to have two LSTM layers instead of one? Or why is there a Dropout() after every layer? Why have the LSTM layers before the Dense layers?

On the “Julia for Optimization and Learning” online course maintained by the Czech Technical University, there is a section (that might be a bit outdated) dedicated to training and storing a network. They create a model with Chain and describe each layer’s role:

m = Chain(
    Conv((2,2), 1=>16, relu),
    Conv((2,2), 16=>8, relu),
    Dense(288, size(y_train,1)),
  • Two convolutional layers extract low-level features from the images.
  • Two pooling layers reduce the size of the previous layer.
  • One flatten layer converts multi-dimensional arrays into one-dimensional vectors.
  • One dense layer is usually applied at the end of the chain.
  • One softmax layer is usually the last one and results in probabilities.

So far, this has been the only resource I’ve found that gives some reasoning to their choice of layers and their ordering.

In my case, I have been trying different architectures with a different number of hidden layers, a mix of LSTM() and Dense(), but results have been lacking. Granted, the data I am working with behave chaotically (as if from a Lévy Process). But I was hoping that with nonlinear activation functions, the model would capture features other “simpler” models, like a Recursive Least Squares, wouldn’t. But so far, the average RMSE for a series of sequential predictions has been nearly the same for an LSTM model and a Recursive Least Squares.

Are there guidelines or rules of thumb on how to compose the hidden layers (which ones, their ordering, how many) in a RNN when the goal is to make time-series predictions?

1 Like

My honest answer is that you make educated guesses based on similar work and your problem, and then try lots of different things and see what performs best. This is part of the “dark magic” of deep learning: many of the essential details are not mentioned in papers, because the true story of the journey to the final architecture is usually messy and is at odds with the perfect mathematical narrative that ML papers like to portray.

Some general tips are worth bearing in mind: try to find an existing model that solves a similar problem to yours as a starting point (arXiv, GitHub), overtrain on a small training set first (no dropout, shallow network), then regularise to improve validation set performance (add layers here), and resist the temptation to iterate on the test set.


This is really good advice. The truth is indeed that our theoretical understanding of neural networks is way behind the experimental success.

1 Like

Hi, @DoktorMike. Thanks for your input. Would you have a ready example of a paper to back that up? I’m working on a response to a few collaborators, and I’m trying to gather as much evidence as I can; a paper would help.

That’s a very broad question, so I think people would be hard-pressed to offer quick and specific examples as answers. This notion of theory severely lagging experimentation is more of a general thing that most people in the field agree is true, but not many dig into super deeply. One classic talk about this that you could use as a branching off point for further research is Ali Rahimi’s 2017 talk about modern machine learning being like alchemy, see

1 Like

I’d say start simple. With only one dense layer about the size of your input layer. Then see what happens if you reduce the size to 75%, 50%, etc. Then add another dense layer, etc. After that start playing with the more complex architectures.

That is historically what people did to come up with the current architectures and you presumably have the advantage of trying out new things much faster with modern hard-/soft-ware.

1 Like