PyTorch DataLoader equivalent for training large models with Flux

holylorenzo · November 5, 2019, 6:25pm

I’m trying to train a large computer vision model I built in Flux. Unfortunately the dataset does not fit into memory.

Does something similar to PyTorch’s DataLoader exist as a Julia package? Is there an ongoing effort on creating one? Has anyone else come across the need for something like it before?

I would be interested to hear your thoughts on this, thanks!

xiaodai · November 5, 2019, 9:24pm

Not sure what data loader does but I am working on JDF.jl which allows each column to be loaded. I am developing methods allow chunk loading or random minibathc loading in v0.4. It’s no where near ready. But feel free to list down Ur requirements. I think just random minibatches?

dfdx · November 5, 2019, 9:43pm

I have found MLDataUtils.jl pretty convenient for this kind of tasks.

holylorenzo · November 7, 2019, 12:08pm

MLDataUtils.jl has a nice interface, similar to PyTorch’s DataSet and DataLoader with nobs and getobs.

To give some more background, PyTorch’s DataLoader basically wraps a data container and makes sure that you can get batches of observations quickly by distributing the load accross multiple threads. This is important especially for computer vision tasks as loading large images and performing expensive transformations are CPU-bound and can’t be precomputed because you there would only be enough memory for a few batches.

I suppose with Julia 1.3’s multi-threaded IO this might be a lot easier to implement? Has anyone done this, generally or especially in the context of machine learning?

JohnnyChen94 · November 9, 2019, 2:39pm

Unfortunately the dataset does not fit into memory.

Dataloader doesn’t do lazy-loading, it’s Dataset that does.

Although it’s not as convenient as Dataset, you can implement a function based on MappedArrays.jl, for example:

using MappedArrays
using FileIO

function load_dataset(root)
    files = map(x->joinpath(root, x), readdir(root))
    return mappedarray(load, files)
end

root = "/Users/jc/Downloads/dataset"

# add ; in interactive environment to disable loading all files
dataset = load_dataset(root);

Image is not read from disk until it’s used, and that’s the trick Dataset does to save your memory. However, the same image is read from disk multiple times, which would be slower than directly reading from memory. It’s a tradeoff between computational time and memory space.

PeterD · November 10, 2019, 3:37pm

I’ve been working on a framework on top of Knet called Photon and as a proof of concept have implemented Dataset/Dataloader functionality (including treading). So perhaps that can serve as inspiration?

You can find the code in subdirectory src/data in

https://github.com/neurallayer/Photon.jl

holylorenzo · November 12, 2019, 9:46am

This looks very nice!
Does it run on Julia 1.3 (rc4) already? If I’m not mistaken threaded IO did not work reliably before until incl. 1.2.

xiaodai · November 12, 2019, 10:15am

Looks real nice. Love how the chain takes care of input size for me so I don’t have to specify that in the chain

PeterD · November 19, 2019, 10:13pm

In general threading does work well, even with IO involved. However some external packages are not yet thread-safe. One bug I found for example is that ImageMagick.jl works well, but not when used through FileIO.

PeterD · November 19, 2019, 10:15pm

Thanks! After using MXNet for a long time, really got addicted to this feature. Before that I always was to lazy to calculate the output sizes (like in PyTorch) and just used the debugger to figure it out

xiaodai · November 19, 2019, 10:22pm

Same. Framework is meant to make it easy right?

xiaodai · November 19, 2019, 10:24pm

In your opinion, what are some reasons why someone would choose Julia ecosystem over others?

PeterD · November 20, 2019, 2:16pm

My assessment is similar to that of Google in this article (https://github.com/tensorflow/swift/blob/master/docs/WhySwiftForTensorFlow.md). The difference being that Google selected the lesser option out of the two final languages (IMHO) when they decided to go with Swift.

With > 1 million mobile developers, I cannot see their benefit to morph Swift into a ML/numerical language. Often design decisions will be conflicting. So Swift for TensorFlow could stay a fork for a long time or even forever. Julia has both the right features and community.
Fully static type checking is a burden at the beginning of data science projects (exploration phase). I think Julia strikes a better balance here, although a bit more compile time checking would be welcome.
Many of the better tools, as to be expected, for Swift are MacOs based (btw a platform with limited NVidia support). Julia has for such a new language excellent tooling in place already for all major platforms. Using Juno on a daily basis and replaces the two IDE syndrome (Notebooks and PyCharm).

But I have proven wrong more often than I like to admit, so who knows

BTW I also looked at Kotlin Native quickly. They did some very cool bindings with TensorFlow and PyTorch as a proof of concept and I was very impressed with the results. But in the end also not as suited as Julia for datascience and numerical computing IMHO.

Tomas_Pevny · November 20, 2019, 5:55pm

I have been using extensively this one https://github.com/pevnak/DataIterators.jl, but it uses processes rather than threads (as at the time of writing threads were not available). I would like to consolidate this effort, as I would like to write this kind of thing for threads as well.

baggepinnen · November 24, 2019, 2:47pm

I also have an embryo of something similar
https://github.com/baggepinnen/DiskDataProviders.jl
It works well for what I’m doing, but could certainly be made more general.

baggepinnen · November 27, 2019, 9:30am

See also [ANN] LengthChannels - Buffered iterators for machine learning

terasakisatoshi · November 8, 2020, 8:58am

How about holylorenzo’s repository?

Topic		Replies	Views
[ANN] DataLoaders.jl (alpha) - basically PyTorch's parallel `DataLoader` Package Announcements flux , machine-learning	0	929	March 14, 2020
Training FLUX models with larger datasets Machine Learning cuda , flux	4	1604	April 7, 2022
How to use dataloader New to Julia flux	0	309	October 31, 2020
Parallel data loading to GPU arrays Machine Learning gpu , parallel , gpuarrays , data , flux	3	1154	January 30, 2019
Shuffeld minibatches of a large datasets Machine Learning knet , flux	3	1305	March 25, 2019

PyTorch DataLoader equivalent for training large models with Flux

Related topics