How to format sequential data to be used in reccurence models when batches are needed?

What is the best (recommended) way to store sequence data for recurrence models in Flux when it is aimed to be used in ‘batches’? At the moment I store the X data (features) as onehot hot encoded where the features span the rows (as with images) and the data for each step (time unit) spans the columns, eg:

2×8 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
 1  1  ⋅  ⋅  ⋅  ⋅  1  1
 ⋅  ⋅  1  1  1  1  ⋅  ⋅

where each row corresponds to a features dimension and each column to the data of a ‘time step’. But I am wondering if this is the advised approach since when using non-recurrence models these columns correspond to separate independent data points. This form makes it a bit cumbersome when using the DataLoader and batches since within each epoch, and within each iteration of the DataLoader batch set, this restructuring has to occur for data ‘x’

x_batch_tt = [Flux.stack([Float32.(x[ii][:,tt]) for ii in 1:length(x)],dims=2) for tt in 1:length(x[1][1,:]) ]

(take each time point ‘tt’ data for each batch ‘ii’ and stack that data into a separate array). Then this means that each element of ‘x_batch_tt’, becomes a matrix where the features for that time tt span rows and the independent sequences for that feature of time tt span columns. I wonder if having to compute this restructuring for each batch sample is wasteful. I also wonder if the original formatting of having the time steps span the columns in the data in the first place is not how it should be done for recurrence data since the columns should be independent samples.

Essentially the data is going from labels (one cold) into an array of 2x8 matrices and being batched into an array of 10 holding 2x8 matrices which I restructure in each iteration (not epoch) to become an array of 8 elements each holding a 2x10 array (the features for the first sequence step across independent sequences) which I loop over.

Can someone please recommend a best practice solution for working with recurrence data that will make the usage of the data in batches simple/straightforward? Is it by using a 3D matrix by any chance?

I think you’ll have to post a proper MWE with source data (dummy is fine) as a set of sequences (i.e. before one-hot encoding) all the way up to x_batch_tt, because I can’t tell what x is nor what shape it should have. Since you mention batching, it can’t be 2x8 because you have more than one sequence in a batch. There are almost certainly easier ways to do this than the posted snippet, but we need to know what your data looks like first.

1 Like

It is all in here: JuliaPlottingExploration/Flux03-GeneralConcepts.ipynb at main · mantzaris/JuliaPlottingExploration · GitHub

I avoid the rearrangement of the datastructure within the epoch loop by making the onehot vector set into a batch. The Flux.batch function is really useful for this, you can pass the vector of onehotbatch matrices into it which returns a 3D array ( a tensor in ML nomenclature ). This can be used to create slices which can extract the features for a step sequence number over a selected batch indices.

The vector of one hot batches is made

x_train = [ Flux.onehotbatch( x_train_cold[ii] , x_categories ) for ii in 1:length(x_train_cold) ]
2×8 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
 1  1  1  1  1  1  1  1
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅

This is put into a tensor where the 3rd dimension is the independent sequence index

x_train_batch = Float32.(Flux.batch( x_train )) #make a batch
(2, 8, 300)

make the model and pass the data for the 1st step in the sequence across batch indices 1 and 4

rnn_model6 = Chain( LSTM(2=>6) , LSTM(6=>4) , Dense(4=>3,sigmoid) , softmax ) 
display( rnn_model6( x_train_batch[:,1,[1,4]] ) )
3×2 Matrix{Float32}:
 0.332693  0.332781
 0.334558  0.334347
 0.332749  0.332872

put it into the epoch loop where we sample the independent sequence numbers

for epoch in 1:1600
    sample_num = rand( 1:sample_size , batch_size )
    x_batch_tmp = x_train_batch[:,:,sample_num]
    y_batch_tmp = y_train_batch[:,:,sample_num]
    loss_tmp, grads = Flux.withgradient( rnn_model6 ) do model
        loss = 0
        for jj in 1:size( x_batch_tmp )[2]  
            y_hat = model( Float32.( x_batch_tmp[:,jj,:] ) )
            loss += Flux.crossentropy( y_hat , y_batch_tmp[:,jj,:] )
        return loss
    Flux.update!( opt , rnn_model6 , grads[1] )
    push!( losses , loss_tmp )     

Since your input data is a list of Vector{String}, there is a pretty straightforward way to do this while still being able to make use of DataLoader. Instead of one-hot encoding everything up front, just pass the arrays of strings to the DataLoader to generate batches and one-hot encode + concatenate afterwards.

Xs = [ ["a", "b", "c"], ["b", "b", "a"], ... ] # Vector{Vector{String}}
Ys = [ [...], [...], ... ] # Same here

dl = DataLoader(Xs, Ys, ...) 

for (xs, ys) in dl # 2-tuple of # Vector{Vector{String}}
  # convert length <batch size> vector of vectors of length <sequence length>
  # to matrix of size <sequence length> x <batch size>
  x_batched = reduce(vcat, transpose.(xs))
  # use the multidimensional array support in{Any,%20Any,%20Vararg{Any,%20N}%20where%20N}
  # to convert this into an array of size <categories> x <sequence length> x <batch size>
  x = onehotbatch(x_batched, x_categories)
  y = ... # do the same thing for y

  loss, grads = Flux.withgradient(rnn_model) do model
    y_hat = rnn_model(x) # Flux RNNs support 3D array inputs
    Flux.crossentropy(y_hat, y) # as do most loss functions


Some things to note from this example:

  • I move as much code as possible outside of the withgradient callback. The more you have in there, the harder AD will have to work and the slower your training will likely be.
  • You could do the transpose step on Xs and Ys before batching, but in practice the overhead is probably small enough that it doesn’t matter.
  • If your labels are not a sequence (e.g. your model is many-to-one instead of seq-to-seq), then you can still use a loop. Either split the one-hot 3D x and y arrays up using e.g. collect(eachslices(x; dims=3)) or split the pre-one-hot x_batched/y_batched string arrays and one-hot encode each timestep as you’ve shown above (e.g. [onehotbatch(x_t, x_categories) for x_t in eachcol(x_batched)]). Again, make sure to do this before calling (with)gradient :slight_smile:
1 Like

Thanks alot for this answer, it is interesting that you produce the onehotbatches within the data loop and keep the gradient section intact and minimal. Is the production of the onehotbatches in the dataloader loop a high overhead? How does your approach compare to my approach that I propose with the tensor? I assume that those steps above the withgradient are not computationally intensive compared to the AD component correct?

Ordinarily it shouldn’t be computationally expensive at all, so where you run it shouldn’t matter. However, an implementation detail of the AD Flux uses by default (Zygote) is that it transforms all code inside the gradient callback unless you tell it otherwise. This can lead to slowdowns in some otherwise innocuous-looking code because the AD doesn’t know how to generate optimal code for all functions, so the rule of thumb is to only put what needs to be differentiated in that callback.

1 Like