Improve performance of matrix computation

Ah, my fault, I forgot that @parallel without reduction is actually asynchronous and you need @sync @parallel. With syncing, however, data transfer becomes a bottleneck and parallel version takes even longer than serial version.

One option is to try threading instead: add threads to Julia as (in terminal)

export JULIA_NUM_THREADS=4

And changing @parallel to Threads.@threads and SharedArray to a normal one.

However, globally I’d make a bet on correct memory layout (column-first arrays) and broadcasting which can often parallelize operations automatically. My latest example f3 seems to be broken right now (sorry for that, I’ll try to fix it a later today), but you should get the general idea.