BLAS/LAPACK are much faster with avx512, as is native if you’re willing to vectorize all the bottlenecks. Besides double-width vectors, it also offers twice the registers (reducing register pressure), and efficient masking, which can make vectorizing with the likes of SIMD easier.
Although masked instructions are about as efficient as their unmasked counterparts, unfortunately no compiler and very few libraries take advantage of them. Some of mine do, which is why PaddedMatrices.jl – which uses masking for unpadded matrices – was about 3x or more faster than Eigen for most small statically sized (unpadded) matrices.
Last I tested, BLAS/LAPACK only benefit from avx512 if you’re using MKL, and not if you’re using OpenBLAS.
Unfortunately, the cheapest avx512 cpu I see from a quick search is a pre-owned 6-core 7800X for $300 on ebay. That’s 50% more than the Ryzen 3600. The Ryzen has higher clock speeds, and less than half the TDP.
For the CPU, unless you’re super excited about vectorization, the new Ryzens look like much better deals.
Old Ryzen’s did have half-rate 256 bit fma throughput, which is bad for numerics and BLAS/LAPACK in particular. The 3600 & Co are full-rate.
EDIT: My 9940X GeekBench vs a prototype of the upcoming 16-core Ryzen 3950X that made the news recently as “record setting”.
While my CPU came out behind in the multithread score (unless I overclocked), the single threaded SGEMM and SFFTs performed much better, at 200.3 and 18.3 GFLOPS vs 98.8 and 13.5 GFLOPS.
So in the particular tasks I spend most of my time on, it does perform better.
Then again, the 3950X will debut for not much over half the cost of the 9940X, and at higher clock speeds than the GeekBenched part…
I know @Tamas_Papp is not in the UK. If anyone in the UK is looking for a custom built workstation I would recommend a company I used to work for. I admit though that they build gaming PCs with all the nice cases and lights, and VR rigs. MEssage me offline for a contact.
Thanks for the advice — unfortunately they are not (yet) available in retail in Europe, even if the CPU is nominally “released”. So I have more time to plan.
Does this have any practical consequence for the RAM I should choose? I plan to go with a G.SKILL 32GB Aegis DDR4 3000MHz CL16 KIT (F4-3000C16D-32GISB). I figured that CL15 or faster RAM clock would not make a huge difference for me, on a B450 chipset.
The X570 chipset is only important, if you want lighting fast Gen4.0 i/o speed - in the future.
The 3x M.2 is important, for data intensive works
Why?
Julia 1.2(1.3) Big improvements will be the real multithreading
so the thread count will be important ! Later more and more julia packages will be Auto-scaling to use all available threads …
I ended up with a ThreadRipper 2950X with a Gigabyte X399 Designare motherboard. Multithreading is pretty great (16 cores, 32 threads), but it does suffer a bit in sequential floating point, I think from missing out on MKL optimizations. Not sure how similar that is to the 3600, but I’d be happy to run any Julia benchmarks if that would be informative for you.
The ZEN2 ( Ryzen 3600 ) AVX2 is much better “The key highlight improvement for floating point performance is full AVX2 support. AMD has increased the execution unit width from 128-bit to 256-bit, allowing for single-cycle AVX2 calculations, rather than cracking the calculation into two instructions and two cycles. This is enhanced by giving 256-bit loads and stores, so the FMA units can be continuously fed. AMD states that due to its energy aware scheduling, there is no predefined frequency drop when using AVX2 instructions (however frequency may be reduced dependent on temperature and voltage requirements, but that’s automatic regardless of instructions used)”
They’ll be released July 7th in the US (7/7 for the 7nm parts).
The 7nm Ryzen clearly beat earlier chips, especially for numerical workloads.
I think they also look like much better choices than all non-avx512 intel parts, which is why I focused on them.
If you don’t mind installing a bunch of unregistered libraries, you could try benchmarking the vectorized pow functions here, or small matrix multiplication like in the “3x or more faster than Eigen” link.
Here is a 2950X GeekBench result that did very well. By adding .gb4 to the end of the urls, you can see some sampled clock speeds. That one ran at 4.4 GHz, while the 395prototype was slower at 4.29 (the released version will clock higher).
While its overall scores were comparable to the 3950X, in SGEMM and SFT it was 62.9 and 10.4 GFLOPS vs 98.8 amd 13.5 GFLOPS.
( I suspect the jump is smaller than that provided by avx512, because avx512 does the number of registers on top of doubling their width, letting you use larger kernels, reducing the ratio of move/fma instruction ratio.)
For JuMP, the first step to answering that question is determining whether most of your time is spent in the problem formulation (actual JuMP/MathOptInterface/solver wrapper Julia code) or inside the particular solver you’re using. Most likely it’s the latter, in which case the answer is of course solver-dependent. But generally, if you’re solving mixed-integer programs, then most solvers can exploit multiple cores. A lot of algorithms for solving LPs and QPs, as well as gradient-based nonlinear optimization are harder to parallelize.
I am working on a paper that applies Bayesian analysis to a high frequency foreign exchange price data. The data has a size of 100 gigabytes. I guess I need a 128 GB ram and a good CPU to do that.
I already have the sas code of the mcmc methodology i want to use, and it is from 2002. I kind of remember HMC is newer than that, right? I think i will just do the translation from sas to julia. If i need more advanced stuff, i will take look at hmc.
At home, I’ve been using an AMD Ryzen 3900X with 32GB of RAM. It has been working fantastically for Julia parallel processing. I’m very happy with it (was on sale last year for ~$400). It is about twice as fast as the Intel i5 8th Gen laptop I was issued by my work. Both the 3900X and the 5900X score well on PassMark’s price to value chart (PassMark CPU Value Chart - Performance / Price of available CPUs).
I think 24 threads is a sweet spot for developing parallel programs. If you are processing large problems on a daily basis, then something bigger might be required depending on the value of your time and how fast you need to turn around projects. While I would like a Threadripper 3990X, the cost is very high (~$4,000), so I would have to be getting a tremendous value from the speed up. Maybe if I was researching a cure or vaccine, then the speed would be justified.