Hi, I’m following d2l.ai’s book to learn deep learning.

I implemented a character-level RNN from scratch and also one that uses the stuff exported by Flux. Both seem to give the same loss and perplexity where the former converges at around 2.

Note that I preprocessed the data according to the Python code from d2l.ai (looking at the underlying source code and translating it to Julia) with an Array of size `(input_dims, seq_len, batch_size)`

after one-hot encoding. I am not using any validation set right now. I did diverge a bit from d2l.ai by creating batches (batch size 32) and using ADAM.

For simplicity, I’ll only post the Flux model.

```
hidden_size = 512
m = Chain(RNN(length(vocab) => hidden_size),
Dense(hidden_size => length(vocab)), softmax) |> gpu
η = 0.01
θ = 0.01
loss(X, y) = crossentropy(m(X), y) |> gpu
opt = Optimiser(ClipNorm(θ), ADAM(η))
```

As for my training function, I wrote it as:

```
function train_step!(m, train_data, train_losses, perplexities, printthis)
train_ls = []
for (X, Y) ∈ train_data
Flux.reset!(m)
l, ∇ = withgradient(params(m)) do
mean([loss(x, y) for (x, y) ∈ zip(eachslice(X, dims=2), eachslice(Y, dims=2))])
end
push!(train_ls, l)
update!(opt, params(m), ∇)
end
ls = mean(train_ls)
push!(train_losses, ls)
push!(perplexities, exp(ls))
printthis && println("Training Loss: $ls Perplexity: $(exp(ls))")
end
function train(m, train_data; epochs=50, printevery=5)
train_losses = []
perplexities = []
for epoch ∈ 1:epochs
epoch % printevery == 0 && println("Epoch $epoch")
train_step!(m, train_data, train_losses, perplexities, epoch % printevery == 0)
flush(stdout)
end
return train_losses, perplexities
end
```

After training for 500 epochs, it outputs the following:

```
Epoch 20
Training Loss: 2.551458 | Perplexity: 12.82579
Epoch 40
Training Loss: 2.5226371 | Perplexity: 12.461415
Epoch 60
Training Loss: 2.4680102 | Perplexity: 11.798946
Epoch 80
Training Loss: 2.4618313 | Perplexity: 11.726266
Epoch 100
Training Loss: 2.4447536 | Perplexity: 11.527708
Epoch 120
Training Loss: 2.7575781 | Perplexity: 15.761624
Epoch 140
Training Loss: 2.4505725 | Perplexity: 11.594983
Epoch 160
Training Loss: 2.5324373 | Perplexity: 12.584141
Epoch 180
Training Loss: 2.5143976 | Perplexity: 12.359161
Epoch 200
Training Loss: 2.5226822 | Perplexity: 12.461977
Epoch 220
Training Loss: 2.529594 | Perplexity: 12.54841
Epoch 240
Training Loss: 2.7767358 | Perplexity: 16.066491
```

Is there something I missed (perhaps in calculating the gradient)? Or something that I incorrectly did? I would appreciate any help. Thanks!