Shared-memory parallelization with large matrix

@cshenton Thanks for your suggestions!

Yes, I do use JULIA_NUM_THREADs=20 and the program is running with the correct number of threads.