BLAS/LAPACK are much faster with avx512, as is native if you’re willing to vectorize all the bottlenecks. Besides double-width vectors, it also offers twice the registers (reducing register pressure), and efficient masking, which can make vectorizing with the likes of SIMD easier.
Although masked instructions are about as efficient as their unmasked counterparts, unfortunately no compiler and very few libraries take advantage of them. Some of mine do, which is why PaddedMatrices.jl – which uses masking for unpadded matrices – was about 3x or more faster than Eigen for most small statically sized (unpadded) matrices.
Last I tested, BLAS/LAPACK only benefit from avx512 if you’re using MKL, and not if you’re using OpenBLAS.
Unfortunately, the cheapest avx512 cpu I see from a quick search is a pre-owned 6-core 7800X for $300 on ebay. That’s 50% more than the Ryzen 3600. The Ryzen has higher clock speeds, and less than half the TDP.
For the CPU, unless you’re super excited about vectorization, the new Ryzens look like much better deals.
Old Ryzen’s did have half-rate 256 bit fma throughput, which is bad for numerics and BLAS/LAPACK in particular. The 3600 & Co are full-rate.
My 9940X GeekBench vs a prototype of the upcoming 16-core Ryzen 3950X that made the news recently as “record setting”.
While my CPU came out behind in the multithread score (unless I overclocked), the single threaded SGEMM and SFFTs performed much better, at 200.3 and 18.3 GFLOPS vs 98.8 and 13.5 GFLOPS.
So in the particular tasks I spend most of my time on, it does perform better.
Then again, the 3950X will debut for not much over half the cost of the 9940X, and at higher clock speeds than the GeekBenched part…