I have 2 machines:
- M1 Mac Mini, with horrible thermals and a peak of 3.2 GHz and 8 GB of RAM (absolutely love it though)
- A beefy linux (Ubuntu 21.10) workstation with 64 GB of DDR5 RAM and a 12th-Gen 4.9 GHz processor with 16 threads. (8 physical cores, efficiency cores disabled)
I have some code, seen here, that is used as a function to solve a PDE. This is the parallel version of the code, to get the serial version you obviously just remove Threads.@spawn
and @sync
.
Running example.jl
from that repo and timing the DifferentialEquations.jl solve
function, we get:
Mac | Work Station Serial | Work Station FLoops | |
---|---|---|---|
t (s) | ~60 | ~60 | ~60 |
As you can see, the performance is basically the same between the two machines (serial version), and the parallel version runs at the same speed as well after moving to FLoops.
I also tried
@btime rand(10000,10000)*rand(10000,10000)
The M1 takes 11.591 s and the Workstation takes 9.730 s, almost the same. Something is going on.
Here are my questions:
- (General) does higher clock speed always equal quicker code if everything is OK? Or are there some situations where some bottlenecks wonât be overcome by clock speed?
- How do I improve the serial performance? Why is a water-cooled 4.9 GHz processor barely keeping up with a 3.2 GHz processor with poor thermals?
- Whatâs wrong with my parallel implementation? I tried LoopVectorization.jl with
@tturbo
but couldnât get it working at all. - Is there some good benchmarking code for this?
Thanks in advance, I hope this brings up some interesting discussion and learning opportunities.