Parallel assembly of a finite element sparse matrix

Also, I think I solved the problem with the tasks: I think I need to start (N+1) threads in order to have N tasks. Apparently, one thread needs to be the main to carry the spine of the computation. If I do this, the tasks finish in more or less the same time (there is still a bit of variation, on the order of 10 %, but nothing like the huge differences between fast and slow tasks before).

This is the speed up of Firenze/Windows 10/with tasks.

Edit: The tasks take about 0.03 sec to start and another 0.03 seconds to spawn. So that can add up.