Training FLUX models with larger datasets

Thanks!

From the DataLoaders documentation (https://github.com/lorenzoh/DataLoaders.jl/blob/master/docs/datacontainers.md) it is easy to create a dataloader object with the links and the methods to (lazy) read the images and the labels. However, how can I shuffle the data? the shuffleobs function does not work with “customized” dataTypes DataLoaders.DataLoader(shuffleobs(data), 16) not work

shuffleobs(data) = shuffleobs(Random.GLOBAL_RNG, data)
function shuffleobs(rng::AbstractRNG, data)
    obsview(data, randperm(rng, numobs(data)))
end

Any suggestion?

Possible practical solution may be , for each epoch, create/update the train_loader with a shuffle version of the links to the images. Since we are working only with links this operation should be fast. A more elegant solution may be possible and/or already implemented.

The example code is:

import DataLoaders.LearnBase: getobs, nobs
using Images

struct ImageDataset
    files::Vector{String}
end
ImageDataset(folder::String) = ImageDataset(readdir(folder))

nobs(data::ImageDataset) = length(data.files)
getobs(data::ImageDataset, i::Int) = Images.load(data.files[i])

data = ImageDataset("path/to/my/images")
for images in DataLoader(data, 16)
    # Do something
end
2 Likes