Fully parallelized for-loop becomes slower with more threads

Basically the title says it all: I have a for-loop that is fully parallelizable, so all the operations in the loop can be calculated independently. They only need access to the same multidimensional array but I make sure that they never use the same index at the same time.

Now for the first few threads, this loop becomes faster the more threads I add. However, after a certain amount, it becomes slower… I understand that I cannot expect a linear speed increase with increasing threads but a slow down seems to be surprising to me… Any idea why that is?

Since using threads comes with its own overhead, it is possible to observe a slowdown when the workload for each thread is not high enough. If this is not the case, I imagine there could still be a slow down if many allocations happen within the loop, as these will then happen for each thread. Its hard to say more without more concrete information but it would be worth to find a minimum working example and look at the times and allocations from benchmarks.

You’re right, I should work out a minimum working example. If I test it with a very simple comparable case I don’t get these results, so there must be something code-specific that I cannot see, as I followed all the points in the performance guide… Especially the only allocation that I have in the whole loop is because of the @threads makro, so that cannot be the problem… But I’ll try to work out a minimum working example.