Hello, I’m trying to figure out what’s wrong with my Flux LSTM model, which is a chain of LSTM, Dense, and softmax. The problem that I’m seeing is that the Dense grads are fine, but I get NaNs for all the LSTM gradients, so the gradients are not propagating back. I don’t get this problem if I replace LSTM with RNN. I’m new to both LSTMs and Flux.jl. I find it strange because LSTM is directly connected to Dense, which is fine.

Here’s the code to reproduce the issue:

```
using Flux
using BSON: @load
@load "lstmdata.bson" X Y
m = Chain(LSTM(15, 30), Dense(30, 2), softmax)
function loss(X, Y)
l = Flux.crossentropy(m.(X)[end],Y[end])
Flux.truncate!(m)
return l
end
l=loss(X[1],Y[1])
Flux.back!(l)
W=params(m)
W[6].grad #fine
W[5].grad #NaNs
```

The data BSON can be downloaded here: lstmdata.bson. Any help would be greatly appreciated. Thanks!