Profiling of Multi-threaded code

Hello all,
I would like to ask for any suggestions / recommendations about profiling multi-threaded code in Julia. The single-threaded version works fine, but once I turned on multi-threading, I hardly observe any improvement. From htop I see that most of the time threads do not do a useful computation.
I have found this recent paper about profiling

but I am not sure, if that tool could be used.
A related question is, would an intel vtune profiler help?
Thank you in advance for answers.
Tomas Pevny

A minimal working example would help. Lack of any improvement could indicate that the threads are waiting for some common resource, or a result of a computation that is done in a single thread, but it is hard to say without seeing the code.

Without knowing your code, it is impossible to know what’s going on. However, I do know that there’s a known issue which can occur. Have you checked to see if it’s due to an inference bug by using function barriers?

Unfortunately, I cannot provide any simple working example, since it tries to parallelise calculation of a gradient of a Neural network, and I cannot publish nor the library, nor the data.
Therefore I have been asking, if there would be a general way, how to find a bottleneck in the code. Something similar to the profiler available for a single-threaded applications.

I would be curious if Intel vTune amplifier would be of any help? I am sure that there is somewhere some bottleneck that just totally kills the parallelisation. Though, I do not know where.

Use @code_warntype first and see if there’s an inference problem. If so, you might be running into what I linked, or one of the issues linked from that issue. The workarounds are also given in that issue.

Hi,
thanks for suggestions. I have tried code_warntype and that has passed.
I have been further playing with the threading and I have found that if set the number of threads used by openblas to one, then I see speed improvement. This seems to me that there was a problem with allocating threads on cores leading to overhead.
So on the end, I have seen about two-fold improvement. What is weird is that as the algorithm progresses, I suddenly see a drop in the speed-up, i.e. the speed will get back to that of the single-thread case and even worse.
On single-thread application, I do not see any issues like this, which makes this really hard to debug.