Help with parallelism

I suspect the bottleneck here is the memory bus (between RAM and CPU cache). To compute expm(obj) each worker has to inspect the entire 1000x1000 matrix. In the end the program is slower because of the pmap overhead.

Is there some kind of benchmark to measure the memory bus’s bandwidth? It should be possible to do a back-of-the-envelope estimate based on that.