I have a simple question about pmap and I can’t find a clear answer anywhere.
When I have a series of pmap’d processes and let’s say, few of them are very heavy and require most of the resources, are the heavy tasks able to use all cores/memory they need or they are going to have any limitation because of the pmap (e.g. one core for process)?
I do not think that there are any “limitations” imposed by pmap. If you have 2 workers and want to apply an expensive function on 10 elements of an array, each of the 2 workers may execute multi-threaded BLAS calls, custom codes, etc. while both using a lot of memory.
But if both workers simultaneously work on one of the “heavy” tasks and need lets say 6GB RAM each but your machine only has 8GB, you will run into trouble.
If you can explain your specific application more (with a working example), maybe we can advice you on what kind of parallel computing is the best fit.
What I was seeing was the pmap operation not being able to make use of all the computing power of my machine. Since I have a mix of very light operations and some very heavy ones, I was wondering if there was a limitation of core usages by pmap processes.
If that is not the case then I have probably to look for bottlenecks inside the pmap’d operations, I would guess…
I don’t think pmap has the “smarts” you’re looking for; it’s (probably) just a simple map where each element is assigned to a Julia process at scheduling time (i.e. not dynamically allocated when other iterations complete and balanced based on load). EDIT: This is wrong, but the following message still stands
If you were to implement such functionality, it would likely be considered out of scope of pmap; however, there are many people who I’m sure would appreciate such functionality (myself included). If you have some ideas for how to implement this, Dagger.jl would be a great place to discuss the design/implementation, as Dagger already has the necessary foundation to support intelligent reallocation of work at runtime.
Pmap has a keyword batchsize whose default value is one, see Distributed Computing · The Julia Language . This means that work is allocated dynamically: if a worker is done with its task, it’ll try and get a new one until all are completed.
But iirc each worker is limited to one core so if you want to use more cores, initiallize more workers or increase threads in blas or similar. That might be wrong though.
@jpsamaroo
I am actually not looking for a smarted pmap, just to have a worker to be able to get all the resources needed to perform a task - and for my understanding this should be the expected behavior…
Unfortunately, scheduling arbitrary work that takes full use of all available CPUs while also being efficient is not an easy problem to solve, even for synthetic tasks.
If you can provide some details on what you’re trying to do (an MWE plus any data that accompanies it), then there’s the chance we can help you either tune pmap for your problem, restructure your problem to be more amenable to efficient pmap execution, or find a better work scheduling algorithm to better suit your needs.
That is the default configuration, but addprocs takes the argument enable_threaded_blas which makes the worker execute linear algebra in a a multi-threaded way. If say you want to run an @threads for loop, make sure the JULIA_NUM_THREADS environment variable is correctly set before you create the workers, because they inherit that configuration.
Because using multithreaded BLAS with multiple Julia processes on the same host simultaneously would cause a drop in computational efficiency for each running BLAS operation. Many people launch a worker for each core on their host.