Parallelism - Understanding pmap and the batch_size parameter

I am running something that looks like the following:

pmap(x -> f(x, A, B, C), args)

I have 48 CPU cores and thus 48 workers after running addprocs. A, B, and C are DataFrames, the largest having approximate dimensions (2e6 x 8), and f performs a series of split-apply-combine operations over the DataFrames, paramaterized by x, and outputs a 1x4 array.

pmap provides a significant performance increase over the regular map, however, I do not fully understand how the batch_size parameter affects execution and performance. On a sample of 500 inputs, my run-times scale with batch_size as follows:

map(x -> f(x, A, B, C), args[1:500])   # ~176.8 seconds
pmap(x -> f(x, A, B, C), args[1:500], batch_size=1) # ~19.3 seconds
pmap(x -> f(x, A, B, C), args[1:500], batch_size=2) # ~10.1 seconds
pmap(x -> f(x, A, B, C), args[1:500], batch_size=10) # ~7.2 seconds
pmap(x -> f(x, A, B, C), args[1:500], batch_size=50) # ~6.5 seconds
pmap(x -> f(x, A, B, C), args[1:500], batch_size=100) # ~6.3 seconds
pmap(x -> f(x, A, B, C), args[1:500], batch_size=500) # ~6.4 seconds

What could potentially explain the increase in performance with increasing batch size? Is possibly due to a decrease in overhead from not having to broadcast the DataFrames to the workers as many times? Does the the fact that I’m seeing close to optimal performance with a batch_size that’s equal to the input size suggest that I may be better off using multithreading?

The documentation for pmap briefly discusses batch_size:

pmap can also use a mix of processes and tasks via the batch_size argument.
For batch sizes greater than 1, the collection is processed in multiple
batches, each of length batch_size or less. A batch is sent as a single
request to a free worker, where a local asyncmap processes elements from the
batch using multiple concurrent tasks.

Does this mean that when I use batch_size=500 for a 500-element input, all of the computation is being done by a single worker which manages asynchronous concurrent tasks/processes? Can finding the right batch_size value be thought of as trying to strike the right balance between parallelism and concurrency subject to the size of the problem and the overhead limitations that result from spawning new processes and broadcasting the data?

I could also use some help understanding some of the process behavior I observed while testing. This first image displays my processes while I ran pmap with batch_size=1.

This first image displays my processes while I ran pmap with batch_size=500.

In the first image where batch_size=1, we see that 10 of Julias 48 processes are running at the time of the screenshot, with the rest of the processes sleeping, and most processes utilizing less than 20% of their CPU capacity. In the second image, where batch_size=500, almost all of the processes are running simultaneously, most at close to full CPU capacity. How do we explain the observed differences between these two cases?


From here:

In the case where the time required to transfer data between processes is much longer than the computation time, it may be a good idea to set the batch_size to a value greater than 1. In this one, each process will compute a batch of values before transferring the data.

As a rule of thumb, it is a good idea to set the batch size so that the computation time of a batch is at least 10 ms, this way most of the time will be spent on the computation of values and not on data transfer.
In general, if the total computation will take less than 100 ms, there’s no need to use multi-processing.