I am making a machine learning model in Flux that needs to analyze nested sequential data (sequences whose elements are each composed of smaller sequences). I have tried to implement the model on my Nvidia GPU but have run into many occurrences of scalar indexing that I do not know how to avoid.

To make my data more understandable I will use a short example. Say I have an x datapoint which is a sequence composed of elements A, B, C. Then element A is composed of elements 1, 2, 3. Element B is composed of elements 4, 5, 6. And element C is composed of elements 7, 8, 9.

My model chain thus consists of two recurrent models, one to create a vector which recurs through the each inner sequence and represents each inner sequence as a vector for each outer sequence element, and another recurrent model to recur through the outer sequence elements and represent the outer sequence as a single vector.

Using my example, it would first recur through A’s elements, 1, 2, 3, and output a single vector, V. Then the this model would recur separately through B’s elements, 4, 5, 6, and output a vector U. Then through C’s elements, outputting a vector W. These 3 vectors are then recurred through with my second model, which outputs the final output of my chain, a single vector Z, representing my sequence of sequences.

My training data is of type `Vector{Tuple{Vector{Flux.OneHotMatrix}, Flux.OneHotArray{UInt32}}}`

. Breaking this down, it is a vector whose elements are tuples. Each tuple is a single x, y training pair. Each tuple’s first element is a x_train data point and the second element is a y_train data point. Each x data point is vector of one-hot-matrices. Each element of the vector represents the first layer of elements in the sequence, so `[A, B, C]`

. Each of these elements are composed of one-hot-matrices, whose columns make up the subsequences, so `[A, B, C] = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]`

. A recurrent model must iterate through the subsequences separately and then another recurrent model through the larger sequence. My y-data points are one-hot-arrays which represent the proper classification of each datapoint.

I am not sure how to take advantage of the parallelism of the GPU and avoid scalar indexing with such a nested array structure of data. Please let me know how I can optimize my data structure or model for GPU usage. I know that my program is not taking full advantage of the GPU as it takes a very long time to train and yet my GPU is only running at 5% and I get many scalar indexing warnings.

If it is relevant, my loss function is simply

```
function loss2(x, y)::Float32
return Flux.Losses.crossentropy(c(x), y) |> gpu
end
```

Furthermore, each model in my model chain is wrapped with the Flux/CUDA gpu function.

Any help optimizing this model for GPU usage would be much appreciated.

Thank you for your help,

Jack