I am making a machine learning model in Flux that needs to analyze nested sequential data (sequences whose elements are each composed of smaller sequences). I have tried to implement the model on my Nvidia GPU but have run into many occurrences of scalar indexing that I do not know how to avoid.
To make my data more understandable I will use a short example. Say I have an x datapoint which is a sequence composed of elements A, B, C. Then element A is composed of elements 1, 2, 3. Element B is composed of elements 4, 5, 6. And element C is composed of elements 7, 8, 9.
My model chain thus consists of two recurrent models, one to create a vector which recurs through the each inner sequence and represents each inner sequence as a vector for each outer sequence element, and another recurrent model to recur through the outer sequence elements and represent the outer sequence as a single vector.
Using my example, it would first recur through A’s elements, 1, 2, 3, and output a single vector, V. Then the this model would recur separately through B’s elements, 4, 5, 6, and output a vector U. Then through C’s elements, outputting a vector W. These 3 vectors are then recurred through with my second model, which outputs the final output of my chain, a single vector Z, representing my sequence of sequences.
My training data is of type Vector{Tuple{Vector{Flux.OneHotMatrix}, Flux.OneHotArray{UInt32}}}
. Breaking this down, it is a vector whose elements are tuples. Each tuple is a single x, y training pair. Each tuple’s first element is a x_train data point and the second element is a y_train data point. Each x data point is vector of one-hot-matrices. Each element of the vector represents the first layer of elements in the sequence, so [A, B, C]
. Each of these elements are composed of one-hot-matrices, whose columns make up the subsequences, so [A, B, C] = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
. A recurrent model must iterate through the subsequences separately and then another recurrent model through the larger sequence. My y-data points are one-hot-arrays which represent the proper classification of each datapoint.
I am not sure how to take advantage of the parallelism of the GPU and avoid scalar indexing with such a nested array structure of data. Please let me know how I can optimize my data structure or model for GPU usage. I know that my program is not taking full advantage of the GPU as it takes a very long time to train and yet my GPU is only running at 5% and I get many scalar indexing warnings.
If it is relevant, my loss function is simply
function loss2(x, y)::Float32
return Flux.Losses.crossentropy(c(x), y) |> gpu
end
Furthermore, each model in my model chain is wrapped with the Flux/CUDA gpu function.
Any help optimizing this model for GPU usage would be much appreciated.
Thank you for your help,
Jack