Lack of improvement from distributed pmap, understanding a simple example

Thank you so much for your kind reply, this is very helpful.

If we consider the setup with BLAS.set_num_threads(1) (@everywhere), what is the reason that we can’t scale efficiently to 4 workers and that there are almost no gain from going from 4 to 8?

Is it because it’s a memory bound problem or is there something else I am missing (communication is deliberately kept to a minimum in the example, but I do need to create arrays in the problem that matters for me)?