Shuffeld minibatches of a large datasets

Alexander-Barth · March 12, 2019, 8:17am

I would like to train a neural network (Knet or Flux, maybe I test both) on a large date set (larger than the available memory) representing a serie of images.

In python tensorflow, I would use the Dataset API (tf.data.Dataset | TensorFlow v2.9.1), which essentially takes an input a python generator returning a single image to produce shuffled minibatches.

Does anybody has an example how I would do this in Julia? The only examples I find was for small datasets.

Alexander-Barth · March 13, 2019, 8:45am

If somebody is interested, here is my current approach. Essentially I make a custom vector type.
The variable fnames is assumed to be a vector of file names and labels a vector of corresponding integer labels. All images are padded to a common size (maxsz).

using FileIO, Images
using Base.Iterators: partition
using Base: size, getindex

# fnames = ...
# labels = ...

perm = randperm(length(fnames))
fnames = fnames[perm]
labels = labels[perm]

nval = 200
fnames_val = fnames[1:nval]
labels_val = labels[1:nval]

fnames_train = fnames[nval+1:end]
labels_train = labels[nval+1:end]


struct ImageDataset <: AbstractArray{Array{Float32,3},1}
    fnames::Vector{String}
    maxsz::NTuple{2,Int}
end

Base.size(d::ImageDataset) = (length(d.fnames),)

function Base.getindex(d::ImageDataset,i::Integer)
    data3 = zeros(Float32,d.maxsz[1],d.maxsz[2],3);

    data = FileIO.load(d.fnames[i]);
    data3[1:size(data,1),1:size(data,2),1] = red.(data);
    data3[1:size(data,1),1:size(data,2),2] = green.(data);
    data3[1:size(data,1),1:size(data,2),3] = blue.(data);
    return data3
end

maxsz = (385, 394)

batch_size = 50

d_train = ImageDataset(fnames_train,maxsz);
d_val = ImageDataset(fnames_val,maxsz);

For knet, I define train and val in the following way by concatenating mini-batches using a generator:

train = ( (KnetArray(cat(d_train[i]..., dims = 4)), (labels_train[i])  for i in partition(1:length(d_train), batch_size) )
val = (KnetArray(cat(d_val[:]..., dims = 4)), (labels_val[:])

Is there a better approach?

jfsantos · March 16, 2019, 11:04pm

I was asking myself the same question the other day and the approach I came up with is very similar to yours (which is basically what I implemented in the past in frameworks such as Keras and PyTorch). Ideally, it would be nice to be able to load minibatches in parallel in another thread/process (which is what Keras and PyTorch dataloaders do), but I still haven’t figured out how to do that in a generic way that other people can use.

Alexander-Barth · March 25, 2019, 2:39pm

This is indeed true. Even if you have a non-generic implementation, I would be very interested to take a look.
It would be nice to have a vector type which e.g. caches the last accessed element and the, say, 50 next ones and updates the cache in a separate thread. This “read-ahead vector” type could wrap around a vector type similar to ImageDataset in my example.

Topic		Replies	Views
PyTorch DataLoader equivalent for training large models with Flux Machine Learning flux	16	4163	November 8, 2020
MNIST dataframe build with my own images New to Julia question	2	331	March 5, 2020
[ANN] DataLoaders.jl (alpha) - basically PyTorch's parallel `DataLoader` Package Announcements flux , machine-learning	0	902	March 14, 2020
Flux: Hard to use train! and DataLoader for minibatched NamedTuple dataset with GPU Machine Learning flux	2	1444	September 24, 2020
Flux - support for mini-batches Machine Learning flux	2	2968	September 20, 2018

Shuffeld minibatches of a large datasets

Related topics