in my training loop of a ML model, I have to iterate over large set of samples which does not fit to the memory. Have anyone tried a producer-consumer model with threads, such that data can be prepared while gradient on minibatch is being calculated? Can anyone point me to place, where I can start ?
By the data in that blog post it was found that that parallelizing the file reading (with multiprocessing, not threading)
was a huge improvement over just making it Async.
Now that could have been because JSON.jl is kinda bad; and thus the file reading what not IO bound but CPU bound. It might even be that most IO is CPU bound when you can’t just mmap.
Or it could be that the common wisdom that async is all you need for IO bounded problems doesn’t actually apply that well on modern hardware (e.g. multiple disks, highspeed random access etc)
But anyway, threading will be better than multiprocessing.
Do you need to use Threads.@threads or will working with worker processes through @distributed also suffice for your application? If so, JuliaDB.jl would be an ideal starting point.
There is now a minimal example in the readme, the example does not run by itself, but shows how I use the package. This package was private until I saw this thread. So far it has not really been written with the intent of making it a registered package for others to use, I think that’s going to be a quite time consuming task. It can at least serve as an inspiration, unless you manage to use it somehow.
Possibly, but I have very limited time to work on this. I created the package mostly for my own use, and wrote up the documentation so that some of my colleagues will understand my code and will be able to modify it.
Unfortunately, writing this kind of library and making it general is a huge effort, and I have only implemented support for the kind of data me and my colleagues are working with.
My strategy for reading data on a separate thread might be useful to others though
That’s a shame. I want to support JDF.jl as a side project, but I also only work on OSS on Sunday mornings. I might just browse your around project for ideas.
What’s useful was your description of things like reading data in another thread, so even if you just write down nuggets like this in the doc for other to read, it will be useful.