Diminishing returns to `nthreads`

Performance with n Threads

I have a script that parses a 1Gb file and writes out two CSV files. I want to make this faster by parallelizing it on a large cloud instance. I’m currently testing on an AWS machine c5.9xlarge instance with the following stats:

$ lscpu | egrep 'Model name|Socket|Thread|NUMA|CPU\(s\)'
CPU(s):                          36
On-line CPU(s) list:             0-35
Thread(s) per core:              2
Socket(s):                       1
NUMA node(s):                    1
Model name:                      Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
NUMA node0 CPU(s):               0-35
(base) ubuntu@ip-***********:~/ML_Ops$ neofetch
            .-/+oossssoo+/-.               ubuntu@ip-172-31-4-229
        `:+ssssssssssssssssss+:`           ----------------------
      -+ssssssssssssssssssyyssss+-         OS: Ubuntu 20.04.2 LTS x86_64
    .ossssssssssssssssssdMMMNysssso.       Host: c5.9xlarge
   /ssssssssssshdmmNNmmyNMMMMhssssss/      Kernel: 5.4.0-1045-aws
  +ssssssssshmydMMMMMMMNddddyssssssss+     Uptime: 1 hour, 7 mins
 /sssssssshNMMMyhhyyyyhmNMMMNhssssssss/    Packages: 658 (dpkg), 7 (snap)
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Shell: bash 5.0.17
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Terminal: /dev/pts/0
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   CPU: Intel Xeon Platinum 8124M (36) @ 1.286GHz
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   GPU: 00:03.0 Amazon.com, Inc. Device 1111
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Memory: 368MiB / 70240MiB
.ssssssssdMMMNhsssssssssshNMMMdssssssss.
 /sssssssshNMMMyhhyyyyhdNMMMNhssssssss/
  +sssssssssdmydMMMMMMMMddddyssssssss+
   /ssssssssssshdmNNNNmyNMMMMhssssss/
    .ossssssssssssssssssdMMMNysssso.
      -+sssssssssssssssssyyyssss+-
        `:+ssssssssssssssssss+:`
            .-/+oossssoo+/-.

One single operation using 8 cores takes after compiling:

 93.440511 seconds (813.48 M allocations: 67.612 GiB, 64.37% gc time)

However, performance seems to stop improving after ~8 threads (starting Julia with julia -t 8) as you can see by the graph below.

                     Performance with n Threads
              ┌                                        ┐ 
            1 ┤■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 272   
   Threads  4 ┤■■■■■■■■■■■■■■ 111                        
            8 ┤■■■■■■■■■■■■ 93                           
           16 ┤■■■■■■■■■■■■■ 104                         
              └                                        ┘ 
                              Time (s)

As the number of threads increases the the GC time seems to also increase erasing improvements from the number of threads.

Is there anything I can do to fix this? Throwing more compute and memory at it is an option moving from a 16 core instance to the current 36 one didn’t reduce the time by more than 10 seconds. I would have thought that increasing the memory capacity to be greater than the amount GC’d would potentially solve this but it also doesn’t seem to work.

The biggest improvement to make here is to use a format better suited to machine reading than CSV. something like Arrow.jl or similar will likely be 10x faster, and work better with multicore. CSV is hard for computers to read because you need to do lots of string processing, and you can’t use seak commands to jump fixed distances in the file.

2 Likes

I’m reading nested JSON data from a DynamoDB export so I can’t convert to Arrow. I could write to Arrow or another format but that’s only about 10 seconds. Most of the time is spent reading, parsing, and flattening json docs.

1 Like

In that case, your best option is to try processing multiple CSVs at a time. There’s a pretty hard limit on how many processes will be beneficial for this.

1 Like