Since your input data is a list of Vector{String}, there is a pretty straightforward way to do this while still being able to make use of DataLoader. Instead of one-hot encoding everything up front, just pass the arrays of strings to the DataLoader to generate batches and one-hot encode + concatenate afterwards.
Xs = [ ["a", "b", "c"], ["b", "b", "a"], ... ] # Vector{Vector{String}}
Ys = [ [...], [...], ... ] # Same here
dl = DataLoader(Xs, Ys, ...)
for (xs, ys) in dl # 2-tuple of # Vector{Vector{String}}
# convert length <batch size> vector of vectors of length <sequence length>
# to matrix of size <sequence length> x <batch size>
x_batched = reduce(vcat, transpose.(xs))
# use the multidimensional array support in https://fluxml.ai/OneHotArrays.jl/dev/reference/#OneHotArrays.onehotbatch-Tuple{Any,%20Any,%20Vararg{Any,%20N}%20where%20N}
# to convert this into an array of size <categories> x <sequence length> x <batch size>
x = onehotbatch(x_batched, x_categories)
y = ... # do the same thing for y
loss, grads = Flux.withgradient(rnn_model) do model
y_hat = rnn_model(x) # Flux RNNs support 3D array inputs
Flux.crossentropy(y_hat, y) # as do most loss functions
end
...
end
Some things to note from this example:
- I move as much code as possible outside of the
withgradientcallback. The more you have in there, the harder AD will have to work and the slower your training will likely be. - You could do the
transposestep onXsandYsbefore batching, but in practice the overhead is probably small enough that it doesn’t matter. - If your labels are not a sequence (e.g. your model is many-to-one instead of seq-to-seq), then you can still use a loop. Either split the one-hot 3D
xandyarrays up using e.g.collect(eachslices(x; dims=3))or split the pre-one-hotx_batched/y_batchedstring arrays and one-hot encode each timestep as you’ve shown above (e.g.[onehotbatch(x_t, x_categories) for x_t in eachcol(x_batched)]). Again, make sure to do this before calling(with)gradient