Parallel assembly of a finite element sparse matrix

Further data for the same problem. This machine:
4 socket(s), 64 cores, AMD Opteron™ Processor 6380,
RAM = 252 GB, Cache: L3 = 6 MB L2 = 2048kiB L1 = 16 KB

Again, the thread-based loop.

Edit: Extended to more threads.