Thank you so much for your kind reply, this is very helpful.
If we consider the setup with BLAS.set_num_threads(1) (@everywhere), what is the reason that we can’t scale efficiently to 4 workers and that there are almost no gain from going from 4 to 8?
Is it because it’s a memory bound problem or is there something else I am missing (communication is deliberately kept to a minimum in the example, but I do need to create arrays in the problem that matters for me)?