Also, do you know how many physical cores your CPU has? If it has 24 threads, it is likely that it only has 12 physical cores, in which case there will be sharply diminishing returns beyond 1200% for many workloads – including BLAS and LAPACK.
Each physical core has its own set of execution units.
Keeping these execution units busy requires very well optimized code. Normally, a few of them will be sitting idle. A second thread on the same physical core can share these execution units, to try and get closer to 100% utilization.
But for many BLAS/LAPACK routines, they’re so well optimized that a single thread is often able to very nearly use a core to its fullest. Often, the cache contention of extra threads will actually hurt the performance of these routines, meaning some of them will actually perform best with only a single thread per physical core.