Not completely related to the original question on MT scaling, but on the specific example (multiple Tridiagonal systems with the same (nc=3200) size to be solved. You can achieve a faster solution with Thomas algorithm (without pre-factorization) applied in a simd manner on blocks of systems (you have to adapt the layout). You can see these slides and this paper. Note that this part at least can be easily ported to GPU.
2 Likes
This seems like a highly relevant discussion: Overhead of `Threads.@threads` - #29 by Elrod
2 Likes