Julia code becomes slower on running on supercomputers and does not scale well when parallelizing with Base.Threads

This is a good article on the Skylake architecture

https://www.nas.nasa.gov/hecc/support/kb/skylake-processors_550.html

I have not followed the whole thread in detail, but if there are performance issues or unexpected performance characteristics with multithreaded Julia code where matrix multiplication takes up a significant fraction of the time, the following may be relevant:

1 Like

See also the manual’s section on argument-type declarations.

It’s also not necessary for const globals. const i::Int = 1 is equivalent to const i = 1, since Julia already knows the type of i from the value 1.

1 Like

This reply is aimed at updating the people associated with this thread with the current situation:

  1. As rightly pointed out by @lmiq , repeated allocation of entirely new arrays as implemented in my older code was a problem. I have modified it as per their suggestions. I shall update this thread with the latest benchmarks when running the newer version of the code. We hope to see a better scaling.

  2. As pointed out by @johnh , getting an idea of the architechture of the cluster I am using might turn out to be important and should be achieved before trying to run the code on AMD processors. I have contacted the system admins of paramshakti , to allow me to use the lstopo utility. I shall share the output (in form of a graphical image) on this thread. I have ran the command on my personal machine just as an example, please have a look at the attached picture to verify if this is what you want me to do in the supercomputer.

  1. If I understand @johnh 's point regarding ThreadPinning.jl correctly, the supercomputer I am using uses hyperthreading and it might have something to do with the way the performance evolves. And the advise here was (correct me if I am wrong) to use the given package to tell Julia to only use one thread per CPU (in some way not use hyperthreads), please correct me if I am wrong. If what I interpreted was correct, I shall try to “pin” each thread to a physically different CPU core and report the changes in performance.

  2. The Fortran’s BLAS operations might be doing some actions in parallel even when we did not explicitly ask it to, kind of what happens when we don’t set BLAS_num_threads=1 in Julia. So the advice here was to run htop or top when the job is running in the supercomputer and report the CPU usage in both cases.

3 Likes

UPDATE 1 - As per the suggestion of @lmiq , I tried to use the memory efficient modification of the original code. Turns out now the entire code takes about 33 mins to run as compared to ~45 mins of the Fortran code when using dd = 20001 !! So yes, using a more memory efficient code had a huge positive impact on the performance of the code !

1 Like

Not to put too fine a point on it but this was exactly what you were being told in the first thread you opened on this:

The performance of parallel code can change drastically depending on all sorts of optimizations you can make to its simple serial execution, so you really need to make sure your code is as efficient (and allocation-free) as possible before you run elaborate experiments on supercomputers.

4 Likes

I had a small question regarding this. Rewriting memory (in this case reusing the matrices) also takes time right ? Suppose hypothetically, in one code I am reusing the memory of an array at every iteration (manually setting all values to zero at the start) instead of allocating a fresh array in memory using M = zeros(N,N) lets say…

Would the code that reuses memory always be faster than the one that is very allocation heavy ? Is it a general rule to always reduce the number of allocations and reuse memory to get the best performance ? If not, how do we use if we want to lean on the side of allocations or memory re-usability.

Re-using memory on the same thread (CPU-Core) is always better than allocating fresh from heap. If re-use crosses cores, then it is still almost always better than allocating fresh.

There are edge-cases for latency critical code where you really don’t care about throughput, but you wouldn’t use julia for that. Stack / arena allocation is potentially better than re-use (also with custom non-OS stacks like bumper.jl).

Register allocation is definitely better than re-use. Use this with care – naively this looks exactly the same as fresh heap allocation, i.e. you need a good feel for the compiler to be able to use that technique.

An example would be: Short-lived Ref or view can often be turned into registers, and this is better than re-using a heap-allocated Ref (keywords: escape analysis, scalar replacement of aggregates, memSSA). But you are at the mercy of the optimizer to understand that the Ref is short-lived; if the compiler doesn’t understand this, then you will end up with fresh allocated objects, which is far worse than reusing a long-lived allocated Ref.

Most of the time, the main trade-off between re-use and fresh allocation is performance against code readability, bugs and programmer productivity (that’s why julia has garbage collection!), not performance against performance.

Allocation-free does not need to be your goal. The problems to tackle are

  1. Many large allocations
  2. Many small allocations
  3. Few large allocations
  4. Few small allocations – if you reached this point of the list, there is often no perceptible performance gain from the often significant work in sussing out the remaining allocs. Very often you would need to mutilate your code to do that, sometimes to the detriment of performance.
3 Likes

I’m sure you can come up with examples where allocating new memory is a good thing (certainly I’ve seen cases where e.g. using @view decreased performance), so it’s always worth benchmarking, but I’d say for the average user like you and me you can take reducing allocations = good as a good rule of thumb when writing code.

I’m not an expert on this at all, but at least my mental model is that (1) allocating memory on the heap takes time and (2) Julia’s GC needs to track heap allocated objects, which also takes time and scales badly.

Oh I see there’s already an expert answer that just came in so I’ll stop here :smiley:

2 Likes

This is a very good point that I forgot to mention!

If you are doing a lot of work with some data, then you should of course put it into an ideal memory layout for your work, even at the cost of allocating space for this re-shuffled copy of the data. “a lot of work” is in relation to the “cost of allocating space and making a re-shuffled copy” and the time saved by having better layout – you will need to develop a feel for that, by trying things out and measuring.

As a special case of views, a perennial example is materializing transposes: Do you work with A = data' or do you work with A = copy(data')? Depends on the workload, but the extra alloc can very often amortize (and if there is an outer loop, you can also re-use the memory in A).

3 Likes

Re-using memory on the same thread (CPU-Core) is always better than allocating fresh from heap. If re-use crosses cores, then it is still almost always better than allocating fresh.

@Uranium238 that is why you should look at the ‘lstopo’ output, or numactl output.
Look at the caches - that is how CPUs access memory. If you have a ‘crossing’ as is termed here you get a cache miss and this takes more time. The numa tools can tell you the rate of cache misses.
That is why I keep going on about thread pinning. However as we have seen here concentrating on getting the single threaded code sorted out is what to do.

1 Like

One note about the optimizations in this specific case: When measuring performance alone (running time), the gains given by better memory management here where minor in the serial version, or even when threading over a small number of cores. This is because the computation is completely dominated by a matrix multiplication. However, the excessive allocations started to be a problem when parallelizing over more threads, as each thread was allocating a lot and thus the garbage collector had to run often in order to keep the code up.

Thus, what happened here was that something that was not performance critical in the serial version became performance critical in the threaded version over many cores. This is why the performance appeared to be good enough compared to the Fortran version locally, but became bad in the cluster comparison. (By the way, as far as I checked, the Fortran code did reuse the memory in the same way the Julia version is doing now).

2 Likes

@Uranium238
For supercomputer specific questions, listen to @johnh instead of me.

Supercomputers are not “larger/more computers”, their defining feature is a fancy interconnect and topology, and they require different tools (eg lstopo), algorithms, care and intuitions than laptop / desktop / workstation / server / cloud.

I have no real experience or access to that, “normal” computers only for me. Stuff that can be done on non-super computers is cheaper to run on non-super computers.

FWIW, your problem looks like it could run on normal computers (or a cluster of off-the-shelf normal computers), not a real supercomputing job.

This post may be useful to you:
https://viralinstruction.com/posts/hardware/

Essentially the problem here is not memory access but memory management. There is also the matter of caching. Your processsors keep a copy of a small amount of memory nearby for really fast access. If you are constantly changing what memory you are using you cannot effectively take advatnage of caching.

3 Likes