Bad performances when using Multithreading and Distributed with heavy LinearAlgebra calculations

Memory bound means that the performance is limited by the memory bandwidth, compute bound means the performance is limited by the speed (and number) of the CPU cores.

The picture to have in mind here is called Roofline Model.

In this simplified model there are essentially 2 resources: memory throughput and compute throughput. Loading and storing of values to RAM takes memory throughput and essentially everything else takes compute time. Parallelization essentially increases the available computing power but does not increase memory throughput. So using more threads only helps if you don’t saturate the memory bandwidth.

Well, we also have cache memory that has a much higher memory bandwidth, but only a limited size.

Is the cache automatically used in Julia, when the array size fits the cache size?

By the way, thanks to everyone who made it much more clear the situation.

This isn’t a Julia thing but a chip thing. CPUs don’t expose the ability to manage cached memory. The most recently used memory just always gets put in cache.

1 Like

Ok, thanks a lot.