Performance with n Threads
I have a script that parses a 1Gb file and writes out two CSV files. I want to make this faster by parallelizing it on a large cloud instance. I’m currently testing on an AWS machine c5.9xlarge
instance with the following stats:
$ lscpu | egrep 'Model name|Socket|Thread|NUMA|CPU\(s\)'
CPU(s): 36
On-line CPU(s) list: 0-35
Thread(s) per core: 2
Socket(s): 1
NUMA node(s): 1
Model name: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
NUMA node0 CPU(s): 0-35
(base) ubuntu@ip-***********:~/ML_Ops$ neofetch
.-/+oossssoo+/-. ubuntu@ip-172-31-4-229
`:+ssssssssssssssssss+:` ----------------------
-+ssssssssssssssssssyyssss+- OS: Ubuntu 20.04.2 LTS x86_64
.ossssssssssssssssssdMMMNysssso. Host: c5.9xlarge
/ssssssssssshdmmNNmmyNMMMMhssssss/ Kernel: 5.4.0-1045-aws
+ssssssssshmydMMMMMMMNddddyssssssss+ Uptime: 1 hour, 7 mins
/sssssssshNMMMyhhyyyyhmNMMMNhssssssss/ Packages: 658 (dpkg), 7 (snap)
.ssssssssdMMMNhsssssssssshNMMMdssssssss. Shell: bash 5.0.17
+sssshhhyNMMNyssssssssssssyNMMMysssssss+ Terminal: /dev/pts/0
ossyNMMMNyMMhsssssssssssssshmmmhssssssso CPU: Intel Xeon Platinum 8124M (36) @ 1.286GHz
ossyNMMMNyMMhsssssssssssssshmmmhssssssso GPU: 00:03.0 Amazon.com, Inc. Device 1111
+sssshhhyNMMNyssssssssssssyNMMMysssssss+ Memory: 368MiB / 70240MiB
.ssssssssdMMMNhsssssssssshNMMMdssssssss.
/sssssssshNMMMyhhyyyyhdNMMMNhssssssss/
+sssssssssdmydMMMMMMMMddddyssssssss+
/ssssssssssshdmNNNNmyNMMMMhssssss/
.ossssssssssssssssssdMMMNysssso.
-+sssssssssssssssssyyyssss+-
`:+ssssssssssssssssss+:`
.-/+oossssoo+/-.
One single operation using 8 cores takes after compiling:
93.440511 seconds (813.48 M allocations: 67.612 GiB, 64.37% gc time)
However, performance seems to stop improving after ~8 threads (starting Julia with julia -t 8
) as you can see by the graph below.
Performance with n Threads
┌ ┐
1 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 272
Threads 4 ┤■■■■■■■■■■■■■■ 111
8 ┤■■■■■■■■■■■■ 93
16 ┤■■■■■■■■■■■■■ 104
└ ┘
Time (s)
As the number of threads increases the the GC
time seems to also increase erasing improvements from the number of threads.
Is there anything I can do to fix this? Throwing more compute and memory at it is an option moving from a 16 core instance to the current 36 one didn’t reduce the time by more than 10 seconds. I would have thought that increasing the memory capacity to be greater than the amount GC’d would potentially solve this but it also doesn’t seem to work.