Diminishing returns to `nthreads`

kailukowiak · April 19, 2021, 10:08pm

Performance with n Threads

I have a script that parses a 1Gb file and writes out two CSV files. I want to make this faster by parallelizing it on a large cloud instance. I’m currently testing on an AWS machine c5.9xlarge instance with the following stats:

$ lscpu | egrep 'Model name|Socket|Thread|NUMA|CPU\(s\)'
CPU(s):                          36
On-line CPU(s) list:             0-35
Thread(s) per core:              2
Socket(s):                       1
NUMA node(s):                    1
Model name:                      Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
NUMA node0 CPU(s):               0-35
(base) ubuntu@ip-***********:~/ML_Ops$ neofetch
            .-/+oossssoo+/-.               ubuntu@ip-172-31-4-229
        `:+ssssssssssssssssss+:`           ----------------------
      -+ssssssssssssssssssyyssss+-         OS: Ubuntu 20.04.2 LTS x86_64
    .ossssssssssssssssssdMMMNysssso.       Host: c5.9xlarge
   /ssssssssssshdmmNNmmyNMMMMhssssss/      Kernel: 5.4.0-1045-aws
  +ssssssssshmydMMMMMMMNddddyssssssss+     Uptime: 1 hour, 7 mins
 /sssssssshNMMMyhhyyyyhmNMMMNhssssssss/    Packages: 658 (dpkg), 7 (snap)
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Shell: bash 5.0.17
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Terminal: /dev/pts/0
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   CPU: Intel Xeon Platinum 8124M (36) @ 1.286GHz
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   GPU: 00:03.0 Amazon.com, Inc. Device 1111
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Memory: 368MiB / 70240MiB
.ssssssssdMMMNhsssssssssshNMMMdssssssss.
 /sssssssshNMMMyhhyyyyhdNMMMNhssssssss/
  +sssssssssdmydMMMMMMMMddddyssssssss+
   /ssssssssssshdmNNNNmyNMMMMhssssss/
    .ossssssssssssssssssdMMMNysssso.
      -+sssssssssssssssssyyyssss+-
        `:+ssssssssssssssssss+:`
            .-/+oossssoo+/-.

One single operation using 8 cores takes after compiling:

 93.440511 seconds (813.48 M allocations: 67.612 GiB, 64.37% gc time)

However, performance seems to stop improving after ~8 threads (starting Julia with julia -t 8) as you can see by the graph below.

                     Performance with n Threads
              ┌                                        ┐ 
            1 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 272   
   Threads  4 ┤■■■■■■■■■■■■■■ 111                        
            8 ┤■■■■■■■■■■■■ 93                           
           16 ┤■■■■■■■■■■■■■ 104                         
              └                                        ┘ 
                              Time (s)

As the number of threads increases the the GC time seems to also increase erasing improvements from the number of threads.

Is there anything I can do to fix this? Throwing more compute and memory at it is an option moving from a 16 core instance to the current 36 one didn’t reduce the time by more than 10 seconds. I would have thought that increasing the memory capacity to be greater than the amount GC’d would potentially solve this but it also doesn’t seem to work.

Oscar_Smith · April 19, 2021, 10:17pm

The biggest improvement to make here is to use a format better suited to machine reading than CSV. something like Arrow.jl or similar will likely be 10x faster, and work better with multicore. CSV is hard for computers to read because you need to do lots of string processing, and you can’t use seak commands to jump fixed distances in the file.

kailukowiak · April 19, 2021, 10:26pm

I’m reading nested JSON data from a DynamoDB export so I can’t convert to Arrow. I could write to Arrow or another format but that’s only about 10 seconds. Most of the time is spent reading, parsing, and flattening json docs.

Oscar_Smith · April 19, 2021, 10:29pm

In that case, your best option is to try processing multiple CSVs at a time. There’s a pretty hard limit on how many processes will be beneficial for this.

Topic		Replies	Views
Parallel Computing with Threads.@threads in HPC is slow? Performance	12	680	July 21, 2024
Multithreading performance regressions in 0.6? General Usage multithreading	2	1169	May 16, 2017
Performance problem on dual socket Xeon under Windows 11 Performance question	6	1398	October 18, 2022
How to achieve perfect scaling with Threads (Julia 1.7.1) Performance multithreading	33	2429	January 13, 2023
Difference between nprocs and nthreads Julia at Scale distributed	2	1006	September 17, 2022

Diminishing returns to `nthreads`

Performance with n Threads

Related topics