Computer specific slowdown on multi-threading on computer cluster (Linux)?

I worked out how to use this a while ago, but there seems to be a complete overhaul. I tried to get this to work, but it didn’t produce the desired results. I’m even more confused now.

I still think that since the code works on another computer that it should work without precompilation or other tricks here. I liked your suggestion about latency on the cluster, however. If you have any other ideas to try, please let me know.

Very keen! Indeed, the cluster prominently lists the use of this on their forums. Do you have some more resources to read or a sample code on how to do this? Thank you!

On slurm, I think that you want to request only one socket per node, via the --sockets-per-node srun flag. Or you might need to go in at a lower level and specify the NUMA locality domain? Note: I have never done this on slurm myself. When I solved this in my own work, it was using a PBS scheduler.

Also people who understand things like CPU architecture much better than I do should chime in if this is actually a really stupid thing to do. But please let me know if this (or something like it) fixes the problem for you. I’ve never heard anyone else talk about this, so I’m not sure if it is a general problem that everyone understands, or if it’s actually not a problem at all and I’m entirely misdiagnosing the issue.

Thank you so much for these details. It hasn’t solve the problem (in fact, made it worse).

I’m just going to assume that this is something to do with the cluster. I want to thank everyone on here, and I will post the solution if we find it. Thank you again! More suggestions are welcome in the meantime, but I think we’ve exhausted the simple, general ideas.

Talking about NUMA zones I would advise everyone to be familiar with the hwloc package. Install hwloc on your system.
Then run ‘lstopo’ which will display the logical layout of your system, with cores, shared caches, NUMA zones and where the IO devices are attached to.

https://www.open-mpi.org/projects/hwloc/

I managed to get a version of the code to run at the level of performance on the laptop. I went back to @Henrique_Becker 's suggestion and just turned off a few optimization flags. This was the clearest issue I could attribute it to.

However, I did play around a lot with the submission script and asked for a new install of Julia to v1.6. I can’t rule out that one of those changes affected the performance, and certainly the performance on the laptop wasn’t affected by these. It could be that there is some deep issue with running code on Mac vs. Linux (and there are some other threads for this), but I think I’ll give the “solved” option to @Henrique_Becker even though the solutions from @stillyslalom and @jacobadenbaum probably also helped and will be useful for anyone else having this issue.

Thank you to everyone who helped out! I really appreciate this community!

1 Like

I work as an HPC Engineer for Dell, so very happy to discuss process pinning and BIOS settings if I can help.

numactl --hardware will display your systems NUMA zones

numastat will show NUMA misses - ie those memory acesses which have had to go across to another NUMA zone to fetch data - which takes more time (non-uniform)

If anyone is using modern processors like Cascade Lake Ice Lake and AMD Rome/Milan there are BIOS settings which affect how the CPU is arranged into NUMA domains
This can affect performance.
Intel - Sub Numa Cluster in the BIOS

AMD - NUMA Per Socket (NPS) in the BIOS

https://www.dell.com/support/kbdoc/en-uk/000176921/bios-characterization-for-hpc-with-intel-cascade-lake-processors

https://www.dell.com/support/kbdoc/en-uk/000137696/amd-rome-is-it-for-real-architecture-and-initial-hpc-performance?lang=en

https://infohub.delltechnologies.com/p/amd-milan-bios-characterization-for-hpc/

3 Likes

I had the same problem. In my case, this discussion clarified the issue.
I use workers (via Distributed) for the more expensive functions now. The less expensive functions are handled by threading via ThreadsX.map() from the package ThreadsX.jl.