How to take full advantage of GPU Parallelism on Nested Sequential Data in Flux

I am making a machine learning model in Flux that needs to analyze nested sequential data (sequences whose elements are each composed of smaller sequences). I have tried to implement the model on my Nvidia GPU but have run into many occurrences of scalar indexing that I do not know how to avoid.

To make my data more understandable I will use a short example. Say I have an x datapoint which is a sequence composed of elements A, B, C. Then element A is composed of elements 1, 2, 3. Element B is composed of elements 4, 5, 6. And element C is composed of elements 7, 8, 9.

My model chain thus consists of two recurrent models, one to create a vector which recurs through the each inner sequence and represents each inner sequence as a vector for each outer sequence element, and another recurrent model to recur through the outer sequence elements and represent the outer sequence as a single vector.

Using my example, it would first recur through A’s elements, 1, 2, 3, and output a single vector, V. Then the this model would recur separately through B’s elements, 4, 5, 6, and output a vector U. Then through C’s elements, outputting a vector W. These 3 vectors are then recurred through with my second model, which outputs the final output of my chain, a single vector Z, representing my sequence of sequences.

My training data is of type Vector{Tuple{Vector{Flux.OneHotMatrix}, Flux.OneHotArray{UInt32}}}. Breaking this down, it is a vector whose elements are tuples. Each tuple is a single x, y training pair. Each tuple’s first element is a x_train data point and the second element is a y_train data point. Each x data point is vector of one-hot-matrices. Each element of the vector represents the first layer of elements in the sequence, so [A, B, C]. Each of these elements are composed of one-hot-matrices, whose columns make up the subsequences, so [A, B, C] = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]. A recurrent model must iterate through the subsequences separately and then another recurrent model through the larger sequence. My y-data points are one-hot-arrays which represent the proper classification of each datapoint.

I am not sure how to take advantage of the parallelism of the GPU and avoid scalar indexing with such a nested array structure of data. Please let me know how I can optimize my data structure or model for GPU usage. I know that my program is not taking full advantage of the GPU as it takes a very long time to train and yet my GPU is only running at 5% and I get many scalar indexing warnings.

If it is relevant, my loss function is simply

function loss2(x, y)::Float32
    return Flux.Losses.crossentropy(c(x), y) |> gpu
end

Furthermore, each model in my model chain is wrapped with the Flux/CUDA gpu function.

Any help optimizing this model for GPU usage would be much appreciated.

Thank you for your help,

Jack

The example is a little sketchy to provide a concrete solution, but in general, GPUs are terrible at handling nested data. A solution that might be feasible to your problem is to first change the data representation from a batch of structures to multiple batches of the nested data. Think changing an array-of-struct (AoS) to a struct of arrays (SoA). Then, to the GPUs eye, each of the nested structures will look like good old matrices, just multiple of them.

Thank you. I’ll work on providing a MWE and try to implement your solution.

Your architecture should not be hard to parallelize. To do so, you can use a 4D data input tensor with shape (num_features, max_inner_seq_length, max_outter_seq_length, batch_size). In order to make all inner (and outter) sequences the same size, you can introduce a special padding symbol.

For your first pass, you can reshape this input tensor to (num_features, max_inner_seq_length, max_outter_seq_length * batch_size) and use any sequence processing model out-of-the-box (e.g. an RNN or Transformer). Doing so, you will get an output tensor of shape (num_out_features, max_outter_seq_length * batch_size).

For your second pass, you can reshape the output of the first pass to (num_out_features, max_outter_seq_length, batch_size) and once again use any sequence processing model of your choice to get an output with shape (num_labels, batch_size).

Many sequence processing models will also enable you to specify masks to ensure padding characters are ignored (not doing so should not be deal breaking though).


The suggestion above works best if inner and outter sequences do not vary too much in size. Alternative approaches are possible when this is not the case at all. For example, you could concatenate all tokens into a (num_features, num_tokens) tensor and have a separate tensor of shape (3, num_tokens) associating each token to an (inner_id, outter_id, batch_id) indices triple. Manipulating such data will require scatter operations that are implemented in GeometricFlux for example.

Finally, if you are looking into data with deeper nesting, it might be useful to start looking at graph neural networks (such as those implemented in GeometricFlux).

1 Like

Thank you so much! You have given me a lot to think about and I will give that a try!

1 Like

If you can discard time, you can take a look on our libraries Mill. and GitHub - CTUAvastLab/JsonGrinder.jl: Towards more automatic processing of structured data, though we have never had time to implement performant GPU implementations.

Would you mind to tell me, what kind of data it is? I am interested in nested structures in general and especially, how to model them using unsupervised approaches.

Thank you for sharing! I took a look and watched your JuliaCon presentation and it is a really interesting concept! Unfortunately I cannot disclose the details about the project I am working on at the moment. However, I do not think that it would be applicable in this particular case but I can think of others where it may be. One that comes to mind is a project I am working on which tries to classify elements of web pages into a few categories. The MILL approach could be useful as web pages and HTML are structured in a very organized and hierarchical manner. GPU implementation would definitely be a game changer too! Especially for larger data sets.