Can't understand what LoopVectorization is doing

I’ve stumbled upon something which looks like black magic to me.

I am solving an ODE, where the right-hand-side function needs to evaluate a large sum, which can be paralleled over a number of threads. I have replaced Threads.@threads macro before the for loop with @tturbo macro, and got a 20 times speed up. However, when I look at the output of top command, it shows that @tturbo uses only single core despite running julia with -t nthreads argument. At the same time, if I look at top output for my old code with Threads.@threads, I see that all the cores are used.

I have compared the solutions of ODE that I obtain in both cases, and they are equal to each other with a reasonable accuracy.

To summarize: on my 64 cores cluster, @tturbo outperforms Threads.@threads by a factor of around 20, although it uses single core instead of 64 (according to top). How is it possible? Why does it use only a single core?

1 Like

because LV has determined that, for your workload, multi-threading is counterproductive and it seems like its judgement is correct.

1 Like

And now you know why some people wanted to call it things like @magic instead!

8 Likes

Adding onto LoopVectorization deciding that threading is not worth the effort - using @turbo (or @tturbo) probably SIMDs much better than just the loop and Threads.@threads did. As far as I know, only if adding threads to the SIMDed version of @turbo would be beneficial to the runtime is the code actually run on multiple threads.

1 Like

How does LoopVectorization estimates that SIMD is enough and adding threads going to ruin the performance? I would like to learn to do this kind of estimate myself.

1 Like

That’s actually a very good point. I would also love to have some additional documentation / reading material on this subject.

It’s the other way around - it does SIMD all the time and only adds threading if the iteration count is large and unrolling doesn’t help enough. It decides that based on a heuristic model of how long vector operations on a given architecture take and estimates how much overhead threads have. The details are… (edit: very) complicated.

There’s also some info here.

Sadly, due to the sheer size of modern instruction sets (RISC, hah! RISC is the new CISC… I’d like more RISC-V :pensive:) doing this sort of thing manually is not feasible. Also, high level surface syntax doesn’t map cleanly to CPU instructions (and anyone still thinking that, even especially for C, is ignoring reality).

8 Likes

Yes, that’s more or less correct.

LoopVectorization also uses the size of your loops to estimate whether it is worth threading.
Threading adds overhead, so based on how expensive it thinks an iteration is, it guesses each thread must have at least X iterations to be worth it. As you make a problem bigger, it’ll start using more threads.

Also, something to keep in mind with Threads.@threads often introduces type instabilities.
Also, when you do @inbounds Threads.@threads, the inbounds will not apply to the code inside the @threads, because that code gets wrapped in a closure. These are two possible “gotchas” that can sometimes lead to bad performance with @threads.

Also, are you autodiffing (ForwardDiff?)ing the @tturbo code? If so, it could also be that the fallback @inbounds @fastmath loop is running instead of the @tturbo loop, and that’s why it’s single threaded. Before of the aforementioned gotchas with @threads, it is still possible that not using threads is faster, even for a loop with a lot of iterations.
The most recent versions of LoopVectorization (>= 0.12.67) will warn you by default if this is the case to make sure you’re aware of it. You can disable the warning via warn_check_args=false, e.g. @tturbo warn_check_args=false for ....

Otherwise, if the loop was fast enough for LV to not consider it profitable, then trying to use 64 threads is probably overkill and will hurt performance (contributing to the 20x speed difference).

7 Likes