How can I control the threads in lower level in multithreading function?

Not an answer to your question, but there’s some discussion on (lack of ) speedup here: Again on reaching optimal parallel scaling - #5 by carstenbauer