RNN model converges at a high training loss

Hi, I’m following d2l.ai’s book to learn deep learning.

I implemented a character-level RNN from scratch and also one that uses the stuff exported by Flux. Both seem to give the same loss and perplexity where the former converges at around 2.

Note that I preprocessed the data according to the Python code from d2l.ai (looking at the underlying source code and translating it to Julia) with an Array of size (input_dims, seq_len, batch_size) after one-hot encoding. I am not using any validation set right now. I did diverge a bit from d2l.ai by creating batches (batch size 32) and using ADAM.

For simplicity, I’ll only post the Flux model.

hidden_size = 512
m = Chain(RNN(length(vocab) => hidden_size), 
    Dense(hidden_size => length(vocab)), softmax) |> gpu

η = 0.01
θ = 0.01

loss(X, y) = crossentropy(m(X), y) |> gpu
opt = Optimiser(ClipNorm(θ), ADAM(η))

As for my training function, I wrote it as:

function train_step!(m, train_data, train_losses, perplexities, printthis)
    train_ls = []
    
    for (X, Y) ∈ train_data
        Flux.reset!(m)
        l, ∇ = withgradient(params(m)) do
            mean([loss(x, y) for (x, y) ∈ zip(eachslice(X, dims=2), eachslice(Y, dims=2))])
        end
        
        push!(train_ls, l)
        
        update!(opt, params(m), ∇)
        
    end
    
    ls = mean(train_ls)
    push!(train_losses, ls)
    push!(perplexities, exp(ls))
    
    printthis && println("Training Loss: $ls Perplexity: $(exp(ls))")
end

function train(m, train_data; epochs=50, printevery=5)
    
    train_losses = []
    perplexities = []
    
    for epoch ∈ 1:epochs
        epoch % printevery == 0 && println("Epoch $epoch")
        train_step!(m, train_data, train_losses, perplexities, epoch % printevery == 0)
        flush(stdout)
    end
    return train_losses, perplexities
end

After training for 500 epochs, it outputs the following:

Epoch 20
Training Loss: 2.551458 | Perplexity: 12.82579
Epoch 40
Training Loss: 2.5226371 | Perplexity: 12.461415
Epoch 60
Training Loss: 2.4680102 | Perplexity: 11.798946
Epoch 80
Training Loss: 2.4618313 | Perplexity: 11.726266
Epoch 100
Training Loss: 2.4447536 | Perplexity: 11.527708
Epoch 120
Training Loss: 2.7575781 | Perplexity: 15.761624
Epoch 140
Training Loss: 2.4505725 | Perplexity: 11.594983
Epoch 160
Training Loss: 2.5324373 | Perplexity: 12.584141
Epoch 180
Training Loss: 2.5143976 | Perplexity: 12.359161
Epoch 200
Training Loss: 2.5226822 | Perplexity: 12.461977
Epoch 220
Training Loss: 2.529594 | Perplexity: 12.54841
Epoch 240
Training Loss: 2.7767358 | Perplexity: 16.066491

Is there something I missed (perhaps in calculating the gradient)? Or something that I incorrectly did? I would appreciate any help. Thanks!