Multi-threaded producer - consumer with threads for leading data

Dear All,

in my training loop of a ML model, I have to iterate over large set of samples which does not fit to the memory. Have anyone tried a producer-consumer model with threads, such that data can be prepared while gradient on minibatch is being calculated? Can anyone point me to place, where I can start ?

Thanks,
Tomas

that post has not been updated to julia 1.3
which will make Threads much better

The training will be single thread, and I don’t think 1.3 is relevant here for file reading

By the data in that blog post it was found that that parallelizing the file reading (with multiprocessing, not threading)
was a huge improvement over just making it Async.
Now that could have been because JSON.jl is kinda bad; and thus the file reading what not IO bound but CPU bound. It might even be that most IO is CPU bound when you can’t just mmap.
Or it could be that the common wisdom that async is all you need for IO bounded problems doesn’t actually apply that well on modern hardware (e.g. multiple disks, highspeed random access etc)
But anyway, threading will be better than multiprocessing.

1 Like

I’m doing that here
https://github.com/baggepinnen/DiskDataProviders.jl
Unfortunately I have not written any docs yet, but the code base is small enough to glance through.

It would be nice to provide a minimal working example, such that we can see if we can use it, or get inspired by the library.

See if these two links are helpful for you!

https://juliacomputing.github.io/JuliaDB.jl/latest/out_of_core/#user

For the main package, check out JuliaDB.jl

https://juliacomputing.github.io/JuliaDB.jl/latest/

Do you need to use Threads.@threads or will working with worker processes through @distributed also suffice for your application? If so, JuliaDB.jl would be an ideal starting point.

1 Like

Very nice. Any room for collaboration with JDF.jl? HTTPS://github.com/xiaodaigh/JDF.jl

Are you sure that repo is public?

Fixed

There is now a minimal example in the readme, the example does not run by itself, but shows how I use the package. This package was private until I saw this thread. So far it has not really been written with the intent of making it a registered package for others to use, I think that’s going to be a quite time consuming task. It can at least serve as an inspiration, unless you manage to use it somehow.

Thanks, that looks interesting. I hope to have time on Friday to play with this little bit.

I packaged the functionality a bit better and wrote up some documentation if anyone is interested
https://baggepinnen.github.io/DiskDataProviders.jl/latest

1 Like

Perhaps some collaboration with JDF.jl?

Possibly, but I have very limited time to work on this. I created the package mostly for my own use, and wrote up the documentation so that some of my colleagues will understand my code and will be able to modify it.

Unfortunately, writing this kind of library and making it general is a huge effort, and I have only implemented support for the kind of data me and my colleagues are working with.

My strategy for reading data on a separate thread might be useful to others though

That’s a shame. I want to support JDF.jl as a side project, but I also only work on OSS on Sunday mornings. I might just browse your around project for ideas.

What’s useful was your description of things like reading data in another thread, so even if you just write down nuggets like this in the doc for other to read, it will be useful.