Pmap using fewer workers than expected after some time

I’m running pmap with 16 workers and batch_size=1 to process hundreds of files. Each file takes over an hour to work through and the files are relatively homogeneous but don’t take exactly the same time. For the first 14 hours or so the computer was using 16 cores at 100% usage. However, at some time in the last few hours the number of fully used cpus dropped to 8. The other 8 processes are still alive, they’re just shown as sleeping.

When I check the logs I find that the other (not-working) workers finished processing their last assigned file without error. However, they didn’t start on the next file.

Any ideas why this would happen? If no error occured, why wouldn’t a worker load the next task?

Actually, now they’re down to 6 workers.

I’m not 100% sure, but I think pmap does prebatching, i.e. the work gets split up into 16 chunks (if you have 16 workers) and those workers then work on their batch of the whole work. If, by chance, some tasks that take longer are aggregated in a few of the 16 chunks, those workers will have to run for longer than others.

3 Likes

Oooh, so if it’s trying to combine the results from each worker every so often, then it will wait until the batch it’s expecting is completely finished, meaning it has to wait for the last worker to finish the last piece of the batch?

This would explain a few things actually. I also just realised that I did a stupid thing. My pmap isn’t really a map, it’s more like a “for each”. I don’t want to return anything, I just want to save the results from each file to a different file. However, what I think I did was return the entire result of each file, meaning it has to combine all the results rather than returning an array of nothing.

Ok, simple explanation: there was an error. It waited until each worker was done to stop the program. I guess I was confused because you don’t see the error until each worker is done.

My understanding is different. I think @distributed does static scheduling as you describe, but pmap is dynamic, although not necessarily fault tolerant.

It depends. If there are no remote workers, it’s not batched, but if there are and batch_size is set to something other than 1, it is. See the source for more info - it’s written in julia, so should be quite readable! :slight_smile: