Using LSTM with Conv layers

I was wondering what is the best practice to use a LSTM (or any RNN) together with some convolutional layers using Flux and in general how to unite recurrent networks with other blocks.

Say I have this simple problem where I want to fit an LSTM to some random data:

using Flux

Nt = 100       # time steps
Nin,Nout = 5,3 # input size, output size
Nh = 28        # hidden dim
lstm = Chain(LSTM(Nin,Nh),Dense(Nh,Nout)) # simple lstm

# generate some fake data
X,Y = [randn(Float32,Nin,Nt) for i=1:10],[randn(Float32,Nout,Nt) for i=1:10] 

data = Flux.Data.DataLoader(X, Y, batchsize=2)
# loss uses broadcasting 
loss(x, y) = sum(Flux.Losses.mse.(lstm.(x), y))
ps = Flux.params(lstm)

Flux.train!(loss,ps,data,ADAM())

From what I’ve understand this is the best practice when using sequences, namely you should use use broadcast when plugging in the data to the model. Please let me know if this is not the case.

I want to get a better model and apply some 1d convolution before feeding to the LSTM.
I’ve managed to do it like this:

cnn_lstm = Chain(
                 y -> Flux.unsqueeze(y',3), 
                # Conv needs a tensor where last dim is batch size
                # need to transpose the matrix as well
                 Conv((3,),Nin=>Nh,pad=1),
                 y -> y[:,:]', # remove last dim
                 LSTM(Nh,Nh),
                 Dense(Nh,Nout))
loss(x, y) = sum(Flux.Losses.mse.(cnn.(x), y))
Flux.train!(loss,ps,data,ADAM())

However I’d imagine this can be done in a much better/efficient way?
In particular I suppose feeding Conv with a tensor of size (Nt,Nin,num_batches) is recommended.
However I would then need to reshape it to an array of arrays to feed it to the LSTM?
Plus LSTM do not seem to accept 3d arrays…

Thanks!