RNN model converges at a high training loss

cc672012 · April 24, 2022, 2:00pm

Hi, I’m following d2l.ai’s book to learn deep learning.

I implemented a character-level RNN from scratch and also one that uses the stuff exported by Flux. Both seem to give the same loss and perplexity where the former converges at around 2.

Note that I preprocessed the data according to the Python code from d2l.ai (looking at the underlying source code and translating it to Julia) with an Array of size (input_dims, seq_len, batch_size) after one-hot encoding. I am not using any validation set right now. I did diverge a bit from d2l.ai by creating batches (batch size 32) and using ADAM.

For simplicity, I’ll only post the Flux model.

hidden_size = 512
m = Chain(RNN(length(vocab) => hidden_size), 
    Dense(hidden_size => length(vocab)), softmax) |> gpu

η = 0.01
θ = 0.01

loss(X, y) = crossentropy(m(X), y) |> gpu
opt = Optimiser(ClipNorm(θ), ADAM(η))

As for my training function, I wrote it as:

function train_step!(m, train_data, train_losses, perplexities, printthis)
    train_ls = []
    
    for (X, Y) ∈ train_data
        Flux.reset!(m)
        l, ∇ = withgradient(params(m)) do
            mean([loss(x, y) for (x, y) ∈ zip(eachslice(X, dims=2), eachslice(Y, dims=2))])
        end
        
        push!(train_ls, l)
        
        update!(opt, params(m), ∇)
        
    end
    
    ls = mean(train_ls)
    push!(train_losses, ls)
    push!(perplexities, exp(ls))
    
    printthis && println("Training Loss: $ls Perplexity: $(exp(ls))")
end

function train(m, train_data; epochs=50, printevery=5)
    
    train_losses = []
    perplexities = []
    
    for epoch ∈ 1:epochs
        epoch % printevery == 0 && println("Epoch $epoch")
        train_step!(m, train_data, train_losses, perplexities, epoch % printevery == 0)
        flush(stdout)
    end
    return train_losses, perplexities
end

After training for 500 epochs, it outputs the following:

Epoch 20
Training Loss: 2.551458 | Perplexity: 12.82579
Epoch 40
Training Loss: 2.5226371 | Perplexity: 12.461415
Epoch 60
Training Loss: 2.4680102 | Perplexity: 11.798946
Epoch 80
Training Loss: 2.4618313 | Perplexity: 11.726266
Epoch 100
Training Loss: 2.4447536 | Perplexity: 11.527708
Epoch 120
Training Loss: 2.7575781 | Perplexity: 15.761624
Epoch 140
Training Loss: 2.4505725 | Perplexity: 11.594983
Epoch 160
Training Loss: 2.5324373 | Perplexity: 12.584141
Epoch 180
Training Loss: 2.5143976 | Perplexity: 12.359161
Epoch 200
Training Loss: 2.5226822 | Perplexity: 12.461977
Epoch 220
Training Loss: 2.529594 | Perplexity: 12.54841
Epoch 240
Training Loss: 2.7767358 | Perplexity: 16.066491

Is there something I missed (perhaps in calculating the gradient)? Or something that I incorrectly did? I would appreciate any help. Thanks!

Topic		Replies	Views
RNN Not learning Machine Learning	9	1251	January 4, 2021
Flux.jl RNN performance Machine Learning	11	3116	October 28, 2018
Custom RNN with Flux gives an error with Flux.train! Machine Learning flux	0	589	October 17, 2019
Porting a RNN model to Flux from PyTorch Machine Learning	5	2140	October 29, 2018
Errors with Flux RNN set Machine Learning question , flux , machine-learning	1	463	April 9, 2022

RNN model converges at a high training loss

Related topics