Oh my! That’s a great detail to know! I’m sad that I missed it. Setting only the JULIA_NUM_THREADS to 4 and everything else to 1 doesn’t make too much of a difference in this application. I just ran a test and this wasn’t it…but it is good to know.
@Henrique_Becker: How do the two clusters (i.e. the resources that you actually request) and your laptop compare to each other in terms of number of CPUs? What kind of CPUs are we talking about?
Sure. here’s a few more details.The laptop is a regular Intel CPU (no ARM), Intel i7. 2.8 GHz with 16GB. There are 4 physical cores (so, 8 hyper-threaded, but I only call 4).
The cluster is Intel Xeon 6138 20-core 2.0 GHz with 192 GB RAM. There are two processors on each node, so there are 40 physical cores in total for a single node.
BTW, a cluster node with 2-8 cores sounds strange to me. I’m used to something like 24 cores per CPU and typically 2 CPUs per node, i.e. 48 cores per node (96 if you count hyperthreading).
This is true, the cluster has 40 cores per node. I’m only reserving some of them.
Here’s one additional detail: I tried to run two runs of the algorithm. The code runs slow (x10 slower) on the first run. Then, it runs quickly on the second (although perhaps not completely as fast as it should be). I could put in a simple dummy calculation on the first try, but this seems a little unnecessary.
These two runs happens at the same process? Julia compiles any function it uses the first time it is called. This is common for all Julia code (not just multithreaded). So a script that uses a lot of code (this includes what comes from inside libraries) but does little effort will have a time many times longer for the first run than an subsequent runs. However, 20s to 30m is absurd, and the compilation time should not be that different because you are running the code in parallel or not.
I do this for all scientific experiments in Julia. It is necessary to force Julia to compile all the code with a dummy instance if you wanna report timings that make some sense in a Journal.
The problem is likely tied to your cluster’s filesystem - if many processes are trying to access the same set of precompilation cache files at the same time, you may see a significant slowdown if the filesystem’s handling of parallel access is particularly poor (more so if the precompilation cache is hosted in your home directory, which might be accessible to the cluster over a relatively slow network link). @johnh is the resident expert in alleviating this sort of pain - he’ll probably recommend something like copying your .julia/ directory to the node before launching Julia to avoid network contention, or at least making sure .julia/ is located on a fast filesystem (e.g. Lustre).
Yes, I have some familiarity with this when using Julia. The code is very large and has many iterations in it. This would apply to the first iteration in the code and that is slow, but subsequent iterations are faster on another computer. So, I still think the problem is with the computer itself.
Aha…that would be a good candidate for a solution here. I had read the post you link to, but I wasn’t using @everywhere, so I thought it must have been something else. In fact, this would make a lot of sense. The cluster is particularly new, so I’m very curious if this is the issue. Very keen!
Is your suggestion to simply copy the .julia file into the working directory? If so, could you provide a link on how to access that version instead of in the home directory? I like this idea!
It may be even faster to build a custom sysimage for your application and copy just that to your node(s) instead of copying the entirety of .julia/ (which will contain a bunch of unrelated files, and will still require some additional runtime compilation).
Another possible source for this problem: when working on servers that use a NUMA architecture (very common for HPC stuff these days), I have had trouble using multithreading. I have usually solved this by passing special instructions to the scheduler to schedule the entire task on a single NUMA node. You might try that to see if you see improvements.
Please forgive the long reply. Today was a busy day requiring my attention on other things.
I tried this suggestion (loading the necessary commands via the submission script), but this did not help the slow down.
I should also amend one statement I made. There are two functions in the code. The first one is always slow (this is the “big” function). Another function that has a few experimental optimizations is faster. This version of the code is actually more heavily reliant on the Threads.@threads command. So, The first, slow function must be more related to BLAS and MKL.
I’m not sure what is going on here now…but the code is still slow.
I worked out how to use this a while ago, but there seems to be a complete overhaul. I tried to get this to work, but it didn’t produce the desired results. I’m even more confused now.
I still think that since the code works on another computer that it should work without precompilation or other tricks here. I liked your suggestion about latency on the cluster, however. If you have any other ideas to try, please let me know.
On slurm, I think that you want to request only one socket per node, via the --sockets-per-node srun flag. Or you might need to go in at a lower level and specify the NUMA locality domain? Note: I have never done this on slurm myself. When I solved this in my own work, it was using a PBS scheduler.
Also people who understand things like CPU architecture much better than I do should chime in if this is actually a really stupid thing to do. But please let me know if this (or something like it) fixes the problem for you. I’ve never heard anyone else talk about this, so I’m not sure if it is a general problem that everyone understands, or if it’s actually not a problem at all and I’m entirely misdiagnosing the issue.
Thank you so much for these details. It hasn’t solve the problem (in fact, made it worse).
I’m just going to assume that this is something to do with the cluster. I want to thank everyone on here, and I will post the solution if we find it. Thank you again! More suggestions are welcome in the meantime, but I think we’ve exhausted the simple, general ideas.
Talking about NUMA zones I would advise everyone to be familiar with the hwloc package. Install hwloc on your system.
Then run ‘lstopo’ which will display the logical layout of your system, with cores, shared caches, NUMA zones and where the IO devices are attached to.
I managed to get a version of the code to run at the level of performance on the laptop. I went back to @Henrique_Becker 's suggestion and just turned off a few optimization flags. This was the clearest issue I could attribute it to.
However, I did play around a lot with the submission script and asked for a new install of Julia to v1.6. I can’t rule out that one of those changes affected the performance, and certainly the performance on the laptop wasn’t affected by these. It could be that there is some deep issue with running code on Mac vs. Linux (and there are some other threads for this), but I think I’ll give the “solved” option to @Henrique_Becker even though the solutions from @stillyslalom and @jacobadenbaum probably also helped and will be useful for anyone else having this issue.
Thank you to everyone who helped out! I really appreciate this community!
I work as an HPC Engineer for Dell, so very happy to discuss process pinning and BIOS settings if I can help.
numactl --hardware will display your systems NUMA zones
numastat will show NUMA misses - ie those memory acesses which have had to go across to another NUMA zone to fetch data - which takes more time (non-uniform)
If anyone is using modern processors like Cascade Lake Ice Lake and AMD Rome/Milan there are BIOS settings which affect how the CPU is arranged into NUMA domains
This can affect performance.
Intel - Sub Numa Cluster in the BIOS
I had the same problem. In my case, this discussion clarified the issue.
I use workers (via Distributed) for the more expensive functions now. The less expensive functions are handled by threading via ThreadsX.map() from the package ThreadsX.jl.