Shuffeld minibatches of a large datasets

I would like to train a neural network (Knet or Flux, maybe I test both) on a large date set (larger than the available memory) representing a serie of images.

In python tensorflow, I would use the Dataset API (tf.data.Dataset  |  TensorFlow v2.9.1), which essentially takes an input a python generator returning a single image to produce shuffled minibatches.

Does anybody has an example how I would do this in Julia? The only examples I find was for small datasets.

1 Like

If somebody is interested, here is my current approach. Essentially I make a custom vector type.
The variable fnames is assumed to be a vector of file names and labels a vector of corresponding integer labels. All images are padded to a common size (maxsz).

using FileIO, Images
using Base.Iterators: partition
using Base: size, getindex

# fnames = ...
# labels = ...

perm = randperm(length(fnames))
fnames = fnames[perm]
labels = labels[perm]

nval = 200
fnames_val = fnames[1:nval]
labels_val = labels[1:nval]

fnames_train = fnames[nval+1:end]
labels_train = labels[nval+1:end]


struct ImageDataset <: AbstractArray{Array{Float32,3},1}
    fnames::Vector{String}
    maxsz::NTuple{2,Int}
end

Base.size(d::ImageDataset) = (length(d.fnames),)

function Base.getindex(d::ImageDataset,i::Integer)
    data3 = zeros(Float32,d.maxsz[1],d.maxsz[2],3);

    data = FileIO.load(d.fnames[i]);
    data3[1:size(data,1),1:size(data,2),1] = red.(data);
    data3[1:size(data,1),1:size(data,2),2] = green.(data);
    data3[1:size(data,1),1:size(data,2),3] = blue.(data);
    return data3
end

maxsz = (385, 394)

batch_size = 50

d_train = ImageDataset(fnames_train,maxsz);
d_val = ImageDataset(fnames_val,maxsz);

For knet, I define train and val in the following way by concatenating mini-batches using a generator:

train = ( (KnetArray(cat(d_train[i]..., dims = 4)), (labels_train[i])  for i in partition(1:length(d_train), batch_size) )
val = (KnetArray(cat(d_val[:]..., dims = 4)), (labels_val[:])

Is there a better approach?

I was asking myself the same question the other day and the approach I came up with is very similar to yours (which is basically what I implemented in the past in frameworks such as Keras and PyTorch). Ideally, it would be nice to be able to load minibatches in parallel in another thread/process (which is what Keras and PyTorch dataloaders do), but I still haven’t figured out how to do that in a generic way that other people can use.

This is indeed true. Even if you have a non-generic implementation, I would be very interested to take a look. :slight_smile:
It would be nice to have a vector type which e.g. caches the last accessed element and the, say, 50 next ones and updates the cache in a separate thread. This “read-ahead vector” type could wrap around a vector type similar to ImageDataset in my example.