How to format sequential data to be used in reccurence models when batches are needed?

Since your input data is a list of Vector{String}, there is a pretty straightforward way to do this while still being able to make use of DataLoader. Instead of one-hot encoding everything up front, just pass the arrays of strings to the DataLoader to generate batches and one-hot encode + concatenate afterwards.

Xs = [ ["a", "b", "c"], ["b", "b", "a"], ... ] # Vector{Vector{String}}
Ys = [ [...], [...], ... ] # Same here

dl = DataLoader(Xs, Ys, ...) 

for (xs, ys) in dl # 2-tuple of # Vector{Vector{String}}
  # convert length <batch size> vector of vectors of length <sequence length>
  # to matrix of size <sequence length> x <batch size>
  x_batched = reduce(vcat, transpose.(xs))
  # use the multidimensional array support in https://fluxml.ai/OneHotArrays.jl/dev/reference/#OneHotArrays.onehotbatch-Tuple{Any,%20Any,%20Vararg{Any,%20N}%20where%20N}
  # to convert this into an array of size <categories> x <sequence length> x <batch size>
  x = onehotbatch(x_batched, x_categories)
  y = ... # do the same thing for y

  loss, grads = Flux.withgradient(rnn_model) do model
    y_hat = rnn_model(x) # Flux RNNs support 3D array inputs
    Flux.crossentropy(y_hat, y) # as do most loss functions
  end

  ...
end

Some things to note from this example:

  • I move as much code as possible outside of the withgradient callback. The more you have in there, the harder AD will have to work and the slower your training will likely be.
  • You could do the transpose step on Xs and Ys before batching, but in practice the overhead is probably small enough that it doesn’t matter.
  • If your labels are not a sequence (e.g. your model is many-to-one instead of seq-to-seq), then you can still use a loop. Either split the one-hot 3D x and y arrays up using e.g. collect(eachslices(x; dims=3)) or split the pre-one-hot x_batched/y_batched string arrays and one-hot encode each timestep as you’ve shown above (e.g. [onehotbatch(x_t, x_categories) for x_t in eachcol(x_batched)]). Again, make sure to do this before calling (with)gradient :slight_smile:
1 Like