Shared-memory parallelization with large matrix

@tkluck I am running the program on a two Intel Xeon E5-2650 v3 processors (which allow for running up to 20 tasks/threads in parallel).