PyTorch DataLoader equivalent for training large models with Flux

I’m trying to train a large computer vision model I built in Flux. Unfortunately the dataset does not fit into memory.

Does something similar to PyTorch’s DataLoader exist as a Julia package? Is there an ongoing effort on creating one? Has anyone else come across the need for something like it before?

I would be interested to hear your thoughts on this, thanks!

2 Likes

Not sure what data loader does but I am working on JDF.jl which allows each column to be loaded. I am developing methods allow chunk loading or random minibathc loading in v0.4. It’s no where near ready. But feel free to list down Ur requirements. I think just random minibatches?

I have found MLDataUtils.jl pretty convenient for this kind of tasks.

MLDataUtils.jl has a nice interface, similar to PyTorch’s DataSet and DataLoader with nobs and getobs.

To give some more background, PyTorch’s DataLoader basically wraps a data container and makes sure that you can get batches of observations quickly by distributing the load accross multiple threads. This is important especially for computer vision tasks as loading large images and performing expensive transformations are CPU-bound and can’t be precomputed because you there would only be enough memory for a few batches.

I suppose with Julia 1.3’s multi-threaded IO this might be a lot easier to implement? Has anyone done this, generally or especially in the context of machine learning?

Unfortunately the dataset does not fit into memory.

Dataloader doesn’t do lazy-loading, it’s Dataset that does.

Although it’s not as convenient as Dataset, you can implement a function based on MappedArrays.jl, for example:

using MappedArrays
using FileIO

function load_dataset(root)
    files = map(x->joinpath(root, x), readdir(root))
    return mappedarray(load, files)
end

root = "/Users/jc/Downloads/dataset"

# add ; in interactive environment to disable loading all files
dataset = load_dataset(root);

Image is not read from disk until it’s used, and that’s the trick Dataset does to save your memory. However, the same image is read from disk multiple times, which would be slower than directly reading from memory. It’s a tradeoff between computational time and memory space.

1 Like

I’ve been working on a framework on top of Knet called Photon and as a proof of concept have implemented Dataset/Dataloader functionality (including treading). So perhaps that can serve as inspiration?

You can find the code in subdirectory src/data in

https://github.com/neurallayer/Photon.jl

6 Likes

This looks very nice!
Does it run on Julia 1.3 (rc4) already? If I’m not mistaken threaded IO did not work reliably before until incl. 1.2.

Looks real nice. Love how the chain takes care of input size for me so I don’t have to specify that in the chain

In general threading does work well, even with IO involved. However some external packages are not yet thread-safe. One bug I found for example is that ImageMagick.jl works well, but not when used through FileIO.

Thanks! After using MXNet for a long time, really got addicted to this feature. Before that I always was to lazy to calculate the output sizes (like in PyTorch) and just used the debugger to figure it out :wink:

1 Like

Same. Framework is meant to make it easy right?

In your opinion, what are some reasons why someone would choose Julia ecosystem over others?

My assessment is similar to that of Google in this article (https://github.com/tensorflow/swift/blob/master/docs/WhySwiftForTensorFlow.md). The difference being that Google selected the lesser option out of the two final languages (IMHO) when they decided to go with Swift.

  1. With > 1 million mobile developers, I cannot see their benefit to morph Swift into a ML/numerical language. Often design decisions will be conflicting. So Swift for TensorFlow could stay a fork for a long time or even forever. Julia has both the right features and community.

  2. Fully static type checking is a burden at the beginning of data science projects (exploration phase). I think Julia strikes a better balance here, although a bit more compile time checking would be welcome.

  3. Many of the better tools, as to be expected, for Swift are MacOs based (btw a platform with limited NVidia support). Julia has for such a new language excellent tooling in place already for all major platforms. Using Juno on a daily basis and replaces the two IDE syndrome (Notebooks and PyCharm).

But I have proven wrong more often than I like to admit, so who knows :wink:

BTW I also looked at Kotlin Native quickly. They did some very cool bindings with TensorFlow and PyTorch as a proof of concept and I was very impressed with the results. But in the end also not as suited as Julia for datascience and numerical computing IMHO.

5 Likes

I have been using extensively this one https://github.com/pevnak/DataIterators.jl, but it uses processes rather than threads (as at the time of writing threads were not available). I would like to consolidate this effort, as I would like to write this kind of thing for threads as well.

2 Likes

I also have an embryo of something similar
https://github.com/baggepinnen/DiskDataProviders.jl
It works well for what I’m doing, but could certainly be made more general.

See also [ANN] LengthChannels - Buffered iterators for machine learning

2 Likes

How about holylorenzo’s repository?

1 Like