Tips for handling large Datasets with a lot of preprocessing

Hello,

i am currently trying to machine learn a local relation between two 2D arrays.
Consider the two 2D CircularArrays A and B of the same size. I want to predict the value of B[x,y] with the region A[x-r:x+r, y-r:y+r].
I have >100 such pairs of arrays, which are each 300x300 large. The radius r for the regions is ~80. So there are a lot of windows in total: 300\cdot300\cdot100=9*10^6 with 6400 Floats32 each. I estimate this to be several hundrets of GB.

I don’t neccecarly need to train on every single one of these windows, but even 1% of them would be a lot of data.

I am training a rather simple Flux.jl network with only Dense layers. It has ~50m parameters in total. I want to train it on my GPU, which has 12GB of memory.

How would you approach the handling of the data? I dont need a detailed answer, but i would really like to hear how someone with more experience than me would handle this.

My current solution is to create a new DataLoader every epoch. This DataLoader contains only a very small portion of the total windows, so they can fit into the gpu memory. But this really takes a long time, as i have to open all 100 Files, cut out a lot of windows each and transfer them to the GPU memory.

I am not very experienced with ML in Julia.

Thank you very much in advance for any Answers :slight_smile:

You could retrieve a random subset of the data beforehand and use this for training. Then your network’s training loop won’t involve slow IO operations.