Pmap using fewer workers than expected after some time

Luapulu · September 16, 2020, 3:27pm

I’m running pmap with 16 workers and batch_size=1 to process hundreds of files. Each file takes over an hour to work through and the files are relatively homogeneous but don’t take exactly the same time. For the first 14 hours or so the computer was using 16 cores at 100% usage. However, at some time in the last few hours the number of fully used cpus dropped to 8. The other 8 processes are still alive, they’re just shown as sleeping.

When I check the logs I find that the other (not-working) workers finished processing their last assigned file without error. However, they didn’t start on the next file.

Any ideas why this would happen? If no error occured, why wouldn’t a worker load the next task?

Actually, now they’re down to 6 workers.

Sukera · September 16, 2020, 3:33pm

I’m not 100% sure, but I think pmap does prebatching, i.e. the work gets split up into 16 chunks (if you have 16 workers) and those workers then work on their batch of the whole work. If, by chance, some tasks that take longer are aggregated in a few of the 16 chunks, those workers will have to run for longer than others.

Luapulu · September 16, 2020, 3:36pm

Oooh, so if it’s trying to combine the results from each worker every so often, then it will wait until the batch it’s expecting is completely finished, meaning it has to wait for the last worker to finish the last piece of the batch?

This would explain a few things actually. I also just realised that I did a stupid thing. My pmap isn’t really a map, it’s more like a “for each”. I don’t want to return anything, I just want to save the results from each file to a different file. However, what I think I did was return the entire result of each file, meaning it has to combine all the results rather than returning an array of nothing.

Luapulu · September 16, 2020, 3:44pm

Ok, simple explanation: there was an error. It waited until each worker was done to stop the program. I guess I was confused because you don’t see the error until each worker is done.

nvenkov1 · February 10, 2021, 11:35am

My understanding is different. I think @distributed does static scheduling as you describe, but pmap is dynamic, although not necessarily fault tolerant.

Sukera · February 10, 2021, 12:29pm

It depends. If there are no remote workers, it’s not batched, but if there are and batch_size is set to something other than 1, it is. See the source for more info - it’s written in julia, so should be quite readable!

Topic		Replies	Views
Pmap use of processor cores Julia at Scale question , pmap , load-balancing	13	2239	June 12, 2019
Requesting idle workers to speed up unbalanced processes with pmap General Usage pmap	9	1565	March 21, 2018
Weird behavior of pmap General Usage	5	1444	July 2, 2019
Problems using pmap(), and doubt about the number of workers/processes to use General Usage pmap	3	1180	February 7, 2019
Behavior of worker pool in pmap Performance pmap	2	915	November 25, 2018

Pmap using fewer workers than expected after some time

Related topics