How to take full advantage of GPU Parallelism on Nested Sequential Data in Flux

Your architecture should not be hard to parallelize. To do so, you can use a 4D data input tensor with shape (num_features, max_inner_seq_length, max_outter_seq_length, batch_size). In order to make all inner (and outter) sequences the same size, you can introduce a special padding symbol.

For your first pass, you can reshape this input tensor to (num_features, max_inner_seq_length, max_outter_seq_length * batch_size) and use any sequence processing model out-of-the-box (e.g. an RNN or Transformer). Doing so, you will get an output tensor of shape (num_out_features, max_outter_seq_length * batch_size).

For your second pass, you can reshape the output of the first pass to (num_out_features, max_outter_seq_length, batch_size) and once again use any sequence processing model of your choice to get an output with shape (num_labels, batch_size).

Many sequence processing models will also enable you to specify masks to ensure padding characters are ignored (not doing so should not be deal breaking though).


The suggestion above works best if inner and outter sequences do not vary too much in size. Alternative approaches are possible when this is not the case at all. For example, you could concatenate all tokens into a (num_features, num_tokens) tensor and have a separate tensor of shape (3, num_tokens) associating each token to an (inner_id, outter_id, batch_id) indices triple. Manipulating such data will require scatter operations that are implemented in GeometricFlux for example.

Finally, if you are looking into data with deeper nesting, it might be useful to start looking at graph neural networks (such as those implemented in GeometricFlux).

1 Like