Tips for handling large Datasets with a lot of preprocessing

felix12123 · July 27, 2024, 4:52pm

Hello,

i am currently trying to machine learn a local relation between two 2D arrays.
Consider the two 2D CircularArrays A and B of the same size. I want to predict the value of B[x,y] with the region A[x-r:x+r, y-r:y+r].
I have >100 such pairs of arrays, which are each 300x300 large. The radius r for the regions is ~80. So there are a lot of windows in total: 300\cdot300\cdot100=9*10^6 with 6400 Floats32 each. I estimate this to be several hundrets of GB.

I don’t neccecarly need to train on every single one of these windows, but even 1% of them would be a lot of data.

I am training a rather simple Flux.jl network with only Dense layers. It has ~50m parameters in total. I want to train it on my GPU, which has 12GB of memory.

How would you approach the handling of the data? I dont need a detailed answer, but i would really like to hear how someone with more experience than me would handle this.

My current solution is to create a new DataLoader every epoch. This DataLoader contains only a very small portion of the total windows, so they can fit into the gpu memory. But this really takes a long time, as i have to open all 100 Files, cut out a lot of windows each and transfer them to the GPU memory.

I am not very experienced with ML in Julia.

Thank you very much in advance for any Answers

Donald · July 27, 2024, 8:24pm

You could retrieve a random subset of the data beforehand and use this for training. Then your network’s training loop won’t involve slow IO operations.

Topic		Replies	Views
Parallel data loading to GPU arrays Machine Learning gpu , parallel , gpuarrays , data , flux	3	1106	January 30, 2019
PyTorch DataLoader equivalent for training large models with Flux Machine Learning flux	16	4095	November 8, 2020
Flux: Hard to use train! and DataLoader for minibatched NamedTuple dataset with GPU Machine Learning flux	2	1434	September 24, 2020
Training FLUX models with larger datasets Machine Learning cuda , flux	4	1485	April 7, 2022
Flux: GPU not working as expected Machine Learning flux	6	2191	July 28, 2020

Tips for handling large Datasets with a lot of preprocessing

Related topics