Not sure what data loader does but I am working on JDF.jl which allows each column to be loaded. I am developing methods allow chunk loading or random minibathc loading in v0.4. It’s no where near ready. But feel free to list down Ur requirements. I think just random minibatches?
MLDataUtils.jl has a nice interface, similar to PyTorch’s DataSet and DataLoader with nobs and getobs.
To give some more background, PyTorch’s DataLoader basically wraps a data container and makes sure that you can get batches of observations quickly by distributing the load accross multiple threads. This is important especially for computer vision tasks as loading large images and performing expensive transformations are CPU-bound and can’t be precomputed because you there would only be enough memory for a few batches.
I suppose with Julia 1.3’s multi-threaded IO this might be a lot easier to implement? Has anyone done this, generally or especially in the context of machine learning?
Unfortunately the dataset does not fit into memory.
Dataloader doesn’t do lazy-loading, it’s Dataset that does.
Although it’s not as convenient as Dataset, you can implement a function based on MappedArrays.jl, for example:
files = map(x->joinpath(root, x), readdir(root))
return mappedarray(load, files)
root = "/Users/jc/Downloads/dataset"
# add ; in interactive environment to disable loading all files
dataset = load_dataset(root);
Image is not read from disk until it’s used, and that’s the trick Dataset does to save your memory. However, the same image is read from disk multiple times, which would be slower than directly reading from memory. It’s a tradeoff between computational time and memory space.
I’ve been working on a framework on top of Knet called Photon and as a proof of concept have implemented Dataset/Dataloader functionality (including treading). So perhaps that can serve as inspiration?
In general threading does work well, even with IO involved. However some external packages are not yet thread-safe. One bug I found for example is that ImageMagick.jl works well, but not when used through FileIO.
Thanks! After using MXNet for a long time, really got addicted to this feature. Before that I always was to lazy to calculate the output sizes (like in PyTorch) and just used the debugger to figure it out
With > 1 million mobile developers, I cannot see their benefit to morph Swift into a ML/numerical language. Often design decisions will be conflicting. So Swift for TensorFlow could stay a fork for a long time or even forever. Julia has both the right features and community.
Fully static type checking is a burden at the beginning of data science projects (exploration phase). I think Julia strikes a better balance here, although a bit more compile time checking would be welcome.
Many of the better tools, as to be expected, for Swift are MacOs based (btw a platform with limited NVidia support). Julia has for such a new language excellent tooling in place already for all major platforms. Using Juno on a daily basis and replaces the two IDE syndrome (Notebooks and PyCharm).
But I have proven wrong more often than I like to admit, so who knows
BTW I also looked at Kotlin Native quickly. They did some very cool bindings with TensorFlow and PyTorch as a proof of concept and I was very impressed with the results. But in the end also not as suited as Julia for datascience and numerical computing IMHO.
I have been using extensively this one https://github.com/pevnak/DataIterators.jl, but it uses processes rather than threads (as at the time of writing threads were not available). I would like to consolidate this effort, as I would like to write this kind of thing for threads as well.