Parallel data loading to GPU arrays

Anyone have pointers to loading data into CuArrays with background tasks while the main thread or process is busy training a Flux model?

Flux seems to have the same if not more flexibility than pytorch, so they fall in the same niche, and I would like to try out Flux instead for my next project.

However my experience is that for the kinds of loosely structured data that such flexibility helps the most with, loading and packaging the data for the model to consume quickly becomes the bottleneck. pytorch provides multiprocess DataLoaders for this. How would I do the same in the Julia ecosystem?

Does @async not do it?

Wondering if there are any good beginner tutorials about this?

I’m coming from the Tensorflow/Keras world, where we can easily handle larger than memory datasets with Dataset iterators.

Super excited about the possibilities of Julia/Flux, but having trouble making the jump from MNIST-sized toy datasets to large ones.

There is this QueuePool referenced from Flux PR #450.

Haven’t tried it, don’t know if it’s in a working state, and don’t have pointers to example usage, maybe @staticfloat does?