Loss function for sequence modeling w/ RNN

I’d like advice on the best way to model the loss function in Flux for an RNN where I’m modeling a sequence of words in labeled sentences. (This is for a regression task, but a sentiment analysis task would look similar.)

I have sentences of varying lengths, which I split into words encoded as one-hot vectors. Each sentence is associated w/ a single label for the entire sentence.

# Extremely Simplified Example data:
words = [ "This long sentence is about the number six <EOS>",
"This short one is about 2.3 <EOS>"]

actual_labels = [6.0, 2.3]

allwords = unique( reduce(vcat, split.(words)) )
nwords = size(allwords,1)

Because each sequence is associated w/ a single label, I don’t know what the best way to define the loss function is in Flux. I only want to measure the loss for the full RNN representation of the entire sequence. I have so far treated the labels for earlier words as missing and defined the loss over non-missing labels only:

revised_labels = [ 
[missing, missing, missing, missing, missing, missing, missing, missing, 6.0], 
[ missing, missing, missing, missing, missing, missing, 2.3] ]
    # treating all labels except <EOS> as missing 

# An example model, but I need to get the loss function right first
model_m = Chain(RNN(nwords, 16, tanh), Dense(16, 1, identity))

function loss(x,y)
    if !ismissing(y)
        loss = (model_m(x) - y)^2
td = vcat([map( v-> Flux.onehot(v, allwords), split(words[1]))], [map( v-> Flux.onehot(v, allwords), split(words[2]))])
train_data = Flux.Data.DataLoader( td, revised_labels);

opt = ADAM(1e-2)
Flux.train!(loss, params(model_m), train_data, opt)

# DimensionMismatch.

For a few different implementations of loss like this one I get a DimensionMismatch error, but I think the real error is that I’m thinking about the loss function incorrectly. I have read the recurrence section of the docs, but so far have not seen a way to the solution for this case. If anyone can point me in the right direction that would be much appreciated.

I’m going to answer my own question in case anyone runs into this problem later and finds this question.

Following the example here worked, though for a regression task the model and loss function need to change somewhat:

function build_model(args)
	scanner = Chain(Dense(args.inpt_dim, args.N, σ), LSTM(args.N, args.N))
	encoder = Dense(args.N, 1, identity) # sum outputs and apply identity activation.
	return scanner, encoder 

function model(x, scanner, encoder)
	state = scanner.(x.data)[end]     # the last element, so the last hidden state   
	encoder(state)[1]                 # this returns a vector of a single element, so take the element  

The loss function itself is

loss(x, y)=  (model(x, scanner, encoder) - y)^2