Multi-threaded producer - consumer with threads for leading data

Tomas_Pevny · November 5, 2019, 9:02am

Dear All,

in my training loop of a ML model, I have to iterate over large set of samples which does not fit to the memory. Have anyone tried a producer-consumer model with threads, such that data can be prepared while gradient on minibatch is being calculated? Can anyone point me to place, where I can start ?

Thanks,
Tomas

jling · November 5, 2019, 10:12am

oxinabox · November 5, 2019, 10:18am

that post has not been updated to julia 1.3
which will make Threads much better

jling · November 5, 2019, 10:19am

The training will be single thread, and I don’t think 1.3 is relevant here for file reading

oxinabox · November 5, 2019, 10:38am

By the data in that blog post it was found that that parallelizing the file reading (with multiprocessing, not threading)
was a huge improvement over just making it Async.
Now that could have been because JSON.jl is kinda bad; and thus the file reading what not IO bound but CPU bound. It might even be that most IO is CPU bound when you can’t just mmap.
Or it could be that the common wisdom that async is all you need for IO bounded problems doesn’t actually apply that well on modern hardware (e.g. multiple disks, highspeed random access etc)
But anyway, threading will be better than multiprocessing.

baggepinnen · November 5, 2019, 1:44pm

I’m doing that here
https://github.com/baggepinnen/DiskDataProviders.jl
Unfortunately I have not written any docs yet, but the code base is small enough to glance through.

Tomas_Pevny · November 5, 2019, 2:11pm

It would be nice to provide a minimal working example, such that we can see if we can use it, or get inspired by the library.

bashonubuntu · November 5, 2019, 3:27pm

See if these two links are helpful for you!

https://juliacomputing.github.io/JuliaDB.jl/latest/out_of_core/#user

For the main package, check out JuliaDB.jl

https://juliacomputing.github.io/JuliaDB.jl/latest/

Do you need to use Threads.@threads or will working with worker processes through @distributed also suffice for your application? If so, JuliaDB.jl would be an ideal starting point.

xiaodai · November 5, 2019, 11:53pm

Very nice. Any room for collaboration with JDF.jl? HTTPS://github.com/xiaodaigh/JDF.jl

baggepinnen · November 6, 2019, 3:55am

Are you sure that repo is public?

xiaodai · November 6, 2019, 7:45am

Fixed

baggepinnen · November 6, 2019, 9:28am

There is now a minimal example in the readme, the example does not run by itself, but shows how I use the package. This package was private until I saw this thread. So far it has not really been written with the intent of making it a registered package for others to use, I think that’s going to be a quite time consuming task. It can at least serve as an inspiration, unless you manage to use it somehow.

Tomas_Pevny · November 6, 2019, 9:43am

Thanks, that looks interesting. I hope to have time on Friday to play with this little bit.

baggepinnen · November 12, 2019, 5:06am

I packaged the functionality a bit better and wrote up some documentation if anyone is interested
https://baggepinnen.github.io/DiskDataProviders.jl/latest

xiaodai · November 12, 2019, 5:10am

Perhaps some collaboration with JDF.jl?

baggepinnen · November 12, 2019, 5:16am

Possibly, but I have very limited time to work on this. I created the package mostly for my own use, and wrote up the documentation so that some of my colleagues will understand my code and will be able to modify it.

Unfortunately, writing this kind of library and making it general is a huge effort, and I have only implemented support for the kind of data me and my colleagues are working with.

My strategy for reading data on a separate thread might be useful to others though

xiaodai · November 13, 2019, 1:37am

That’s a shame. I want to support JDF.jl as a side project, but I also only work on OSS on Sunday mornings. I might just browse your around project for ideas.

What’s useful was your description of things like reading data in another thread, so even if you just write down nuggets like this in the doc for other to read, it will be useful.

Topic		Replies	Views
Multi-threading or multi-processing, how to know which to use and when? Performance question , parallel , multithreading , distributed	32	6271	December 1, 2021
Reading and processing Data files concurrently Data parallel	18	3804	September 20, 2017
Can Julia efficiently make use of 20+ cores for transforming hundreds of millions of rows for machine learning? Machine Learning question , big-data	27	2993	December 1, 2020
Questions on parallel programming terminology Julia at Scale question , parallel , distributed , threads	7	2016	May 8, 2020
How to read data from a file in a multithreading loop General Usage multithreading	7	390	November 10, 2022

Multi-threaded producer - consumer with threads for leading data

Related topics