Hi, I’m implementing a BLSTM for a personal (soon finished) project of mine and I’m currently very confused as to how BLSTM’s are implemented and found various kinds of implementations in papers online, mine should concatenate the output of the two passes into a single array for each word at a time. My current code reads:
struct BLSTM{A,B,C}
forward :: Recur{LSTMCell{A,B,C},C}
backward :: Recur{LSTMCell{A,B,C},C}
outdim :: Int
end
function BLSTM(in::Int,out::Int)
forward = LSTM(in,out)
backward = LSTM(in,out)
return BLSTM(forward,backward,out*2)
end
function (m::BLSTM)(x::AbstractArray)
forward_out = m.forward(x)
backward_out = reverse(m.backward(reverse(x,dims=2)),dims=2)
return cat(forward_out,backward_out,dims=1)
end
Flux.trainable(m::BLSTM) = (m.forward,m.backward)
@functor BLSTM
More specifically the double reverse. My thought process is that I reverse the input and push it onto the LSTM, the result of which is the other way around so I reverse it back.
Looks mostly good. The biggest tweak you’ll need is to change reverse(..., dims=2) to reverse(..., dims=3). Per Model Reference · Flux, the time dimension is 3rd/last when using a dense input.
On to more minor/stylistic feedback, you can change the struct definition to
struct Bidirectional{A<:Recur,B<:Recur}
forward :: A
backward :: B
end
@functor Bidirectional
This allows it to be used with any RNN type. Note that overriding trainable is not necessary most of the time, because it falls back to calling functor which @functor implements for you.
However, aren’t I dealing directly with the stateful cells? LSTM gives a Recur{LSTMCell{...}}.
My input is of the form 300xN, with 300 being “time” (Length of a Vector). Running LSTM(30,10) on the Embedding(37,30) of an input array gives me an array 10×300×64. “Time” is second, and the third is the batch.
That docs section is also dealing with stateful cells (that’s what Recur is, a wrapper that makes cells and other functions stateful). Note that LSTM === Recur{LSTMCell} <: Recur.
Unfortunately we have no way of detecting intent and warning about this, but those input dimensions are backwards. Calling an RNN with an input of shape features×time×batch won’t error, but it will compute the wrong results. The only accepted format is features×batch×time (see note → [1]), so having the last 2 dimensions backwards means that network outputs will be completely wrong.
Thankfully, there’s an easy fix for this. If you transpose your input from 300xN to Nx300 before feeding it to the Embedding, everything will work correctly.
There is a good (but unfortunately not often talked about) reason for this. Since RNNs operate on one timestep at a time, we want to preserve memory locality within each timestep for the best performance. That means putting the time dimension first (for row-major arrays as in Python land) or last (for col-major arrays in Julia). It is possible to slap the time dimension in the middle and we may support that in the future, but there’s a good chance doing so will perform noticeably worse. ↩︎