Hi, I’m following d2l.ai’s book to learn deep learning.
I implemented a character-level RNN from scratch and also one that uses the stuff exported by Flux. Both seem to give the same loss and perplexity where the former converges at around 2.
Note that I preprocessed the data according to the Python code from d2l.ai (looking at the underlying source code and translating it to Julia) with an Array of size
(input_dims, seq_len, batch_size) after one-hot encoding. I am not using any validation set right now. I did diverge a bit from d2l.ai by creating batches (batch size 32) and using ADAM.
For simplicity, I’ll only post the Flux model.
hidden_size = 512 m = Chain(RNN(length(vocab) => hidden_size), Dense(hidden_size => length(vocab)), softmax) |> gpu η = 0.01 θ = 0.01 loss(X, y) = crossentropy(m(X), y) |> gpu opt = Optimiser(ClipNorm(θ), ADAM(η))
As for my training function, I wrote it as:
function train_step!(m, train_data, train_losses, perplexities, printthis) train_ls =  for (X, Y) ∈ train_data Flux.reset!(m) l, ∇ = withgradient(params(m)) do mean([loss(x, y) for (x, y) ∈ zip(eachslice(X, dims=2), eachslice(Y, dims=2))]) end push!(train_ls, l) update!(opt, params(m), ∇) end ls = mean(train_ls) push!(train_losses, ls) push!(perplexities, exp(ls)) printthis && println("Training Loss: $ls Perplexity: $(exp(ls))") end function train(m, train_data; epochs=50, printevery=5) train_losses =  perplexities =  for epoch ∈ 1:epochs epoch % printevery == 0 && println("Epoch $epoch") train_step!(m, train_data, train_losses, perplexities, epoch % printevery == 0) flush(stdout) end return train_losses, perplexities end
After training for 500 epochs, it outputs the following:
Epoch 20 Training Loss: 2.551458 | Perplexity: 12.82579 Epoch 40 Training Loss: 2.5226371 | Perplexity: 12.461415 Epoch 60 Training Loss: 2.4680102 | Perplexity: 11.798946 Epoch 80 Training Loss: 2.4618313 | Perplexity: 11.726266 Epoch 100 Training Loss: 2.4447536 | Perplexity: 11.527708 Epoch 120 Training Loss: 2.7575781 | Perplexity: 15.761624 Epoch 140 Training Loss: 2.4505725 | Perplexity: 11.594983 Epoch 160 Training Loss: 2.5324373 | Perplexity: 12.584141 Epoch 180 Training Loss: 2.5143976 | Perplexity: 12.359161 Epoch 200 Training Loss: 2.5226822 | Perplexity: 12.461977 Epoch 220 Training Loss: 2.529594 | Perplexity: 12.54841 Epoch 240 Training Loss: 2.7767358 | Perplexity: 16.066491
Is there something I missed (perhaps in calculating the gradient)? Or something that I incorrectly did? I would appreciate any help. Thanks!