Poor performance multiplying many (large) matrices multithreaded

Yes, BLAS’s threads don’t play nicely with Julia’s threads (yet). Large matrix multiplication is already multithreaded but there’s a heuristic that turns it off for sufficiently small matrices. I don’t recall the exact cutoffs, but that may explain the differences here. Just use BLAS.set_num_threads(1) if you’re going to rely upon Julia’s threading at a higher level instead.

1 Like