Workstation advice (for mostly Julia use)

Tamas_Papp · June 18, 2019, 9:18am

I am about to build a new workstation. It is mainly used for numerical tasks, which is nowadays 99% Julia for me. This system is kind of the sweet spot for things a laptop is not adequate for, yet having full control over everything is valued more than using our cluster.

Typically I write Julia code that uses 10–30% BLAS/LAPACK, rest is native. For parallel workloads (MCMC, simulation), I use 3–5 threads. Lot of branches, typical CPU, not GPU stuff.

I was thinking that I would wait for the Ryzen 5 3600 (not X, as it does not seem to be worth it) to trickle down to retail in a few weeks. But I am not sure what motherboard to use, and whether the choice of chipsets makes any difference.

Most advice I found online is about building gaming PCs, not sure how much of that carries over. Any advice would be appreciated.

louisponet · June 18, 2019, 9:31am

I think it might be worth looking at recommendations for building “Creator” pc’s. I.e. Youtubers, people who stream and whatnot. The rendering/exporting and streaming tasks seem to be quite cpu-demanding and relatively well optimized (not sure in how far they actually use BLAS but most are avx workloads). I’m not sure if there are build guides around that topic.

From what I know (grain of salt), motherboards mostly matter in terms of overclocking (#of phases of power delivery, VRM quality etc), and in PCI-E availability. Aside from that I think the cheapest that has the features you want is fine.

Note on Ryzen, I’ve heard that the previous notion of “ram clockspeeds don’t matter so much” doesn’t hold since the speed of the infinity fabric in between the cpu chiplets scales with the ram speed.

longemen3000 · June 19, 2019, 10:00pm

The new generation has some sort of multiplier over the infinity fabric, but the dependence is still true
From Anandtech.com :

One of the features of (infinity fabric, second generation) is that the clock has been decoupled from the main DRAM clock. In Zen and Zen+, the IF frequency was coupled to the DRAM frequency, which led to some interesting scenarios where the memory could go a lot faster but the limitations in the IF meant that they were both limited by the lock-step nature of the clock. For Zen 2, AMD has introduced ratios to the IF2, enabling a 1:1 normal ratio or a 2:1 ratio that reduces the IF2 clock in half.

Elrod · June 19, 2019, 10:40pm

BLAS/LAPACK are much faster with avx512, as is native if you’re willing to vectorize all the bottlenecks. Besides double-width vectors, it also offers twice the registers (reducing register pressure), and efficient masking, which can make vectorizing with the likes of SIMD easier.
Although masked instructions are about as efficient as their unmasked counterparts, unfortunately no compiler and very few libraries take advantage of them. Some of mine do, which is why PaddedMatrices.jl – which uses masking for unpadded matrices – was about 3x or more faster than Eigen for most small statically sized (unpadded) matrices.

Last I tested, BLAS/LAPACK only benefit from avx512 if you’re using MKL, and not if you’re using OpenBLAS.

Unfortunately, the cheapest avx512 cpu I see from a quick search is a pre-owned 6-core 7800X for $300 on ebay. That’s 50% more than the Ryzen 3600. The Ryzen has higher clock speeds, and less than half the TDP.

For the CPU, unless you’re super excited about vectorization, the new Ryzens look like much better deals.
Old Ryzen’s did have half-rate 256 bit fma throughput, which is bad for numerics and BLAS/LAPACK in particular. The 3600 & Co are full-rate.

EDIT:
My 9940X GeekBench vs a prototype of the upcoming 16-core Ryzen 3950X that made the news recently as “record setting”.
While my CPU came out behind in the multithread score (unless I overclocked), the single threaded SGEMM and SFFTs performed much better, at 200.3 and 18.3 GFLOPS vs 98.8 and 13.5 GFLOPS.
So in the particular tasks I spend most of my time on, it does perform better.
Then again, the 3950X will debut for not much over half the cost of the 9940X, and at higher clock speeds than the GeekBenched part…

johnh · June 20, 2019, 5:41am

I know @Tamas_Papp is not in the UK. If anyone in the UK is looking for a custom built workstation I would recommend a company I used to work for. I admit though that they build gaming PCs with all the nice cases and lights, and VR rigs. MEssage me offline for a contact.

Tamas_Papp · June 20, 2019, 6:05am

Thanks for the advice — unfortunately they are not (yet) available in retail in Europe, even if the CPU is nominally “released”. So I have more time to plan.

Does this have any practical consequence for the RAM I should choose? I plan to go with a G.SKILL 32GB Aegis DDR4 3000MHz CL16 KIT (F4-3000C16D-32GISB). I figured that CL15 or faster RAM clock would not make a huge difference for me, on a B450 chipset.

saschatimme · June 20, 2019, 6:47am

There is this slide from AMD regarding RAM speed:

(Picture is taken from the Zen 2 deep dive at anandtech]

yakir12 · June 20, 2019, 8:40am

I mean for a company like AMD they sure make shitty graphs…

ImreSamu · June 20, 2019, 12:39pm

I use 3–5 threads. Lot of branches,

My ideal future proof machine will be:

12c/24t Ryzen 3900X ( ~ 550 Eur )
basic X570 motherboard with 3 M.2 PCIe Gen4 ; 4 DIMM slot ( ~ 200 Eur )
Corsair MP600 PCIe Gen 4 M.2 SSD 1 TB to Cost € 249 “4950 MB/s and 4250MB/s in sequential reading and writing”
- https://www.guru3d.com/news-story/corsair-mp600-pcie-gen-4-m-2-ssd-1-tb-to-cost-€-249.html

The X570 chipset is only important, if you want lighting fast Gen4.0 i/o speed - in the future.
The 3x M.2 is important, for data intensive works

Why?

Julia 1.2(1.3) Big improvements will be the real multithreading
so the thread count will be important ! Later more and more julia packages will be Auto-scaling to use all available threads …
I like the gnu parallel

cscherrer · June 20, 2019, 1:52pm

I ended up with a ThreadRipper 2950X with a Gigabyte X399 Designare motherboard. Multithreading is pretty great (16 cores, 32 threads), but it does suffer a bit in sequential floating point, I think from missing out on MKL optimizations. Not sure how similar that is to the 3600, but I’d be happy to run any Julia benchmarks if that would be informative for you.

ImreSamu · June 20, 2019, 2:31pm

The ZEN2 ( Ryzen 3600 ) AVX2 is much better
“The key highlight improvement for floating point performance is full AVX2 support. AMD has increased the execution unit width from 128-bit to 256-bit, allowing for single-cycle AVX2 calculations, rather than cracking the calculation into two instructions and two cycles. This is enhanced by giving 256-bit loads and stores, so the FMA units can be continuously fed. AMD states that due to its energy aware scheduling, there is no predefined frequency drop when using AVX2 instructions (however frequency may be reduced dependent on temperature and voltage requirements, but that’s automatic regardless of instructions used)”

Elrod · June 20, 2019, 3:49pm

They’ll be released July 7th in the US (7/7 for the 7nm parts).

The 7nm Ryzen clearly beat earlier chips, especially for numerical workloads.
I think they also look like much better choices than all non-avx512 intel parts, which is why I focused on them.

If you don’t mind installing a bunch of unregistered libraries, you could try benchmarking the vectorized pow functions here, or small matrix multiplication like in the “3x or more faster than Eigen” link.

Here is a 2950X GeekBench result that did very well. By adding .gb4 to the end of the urls, you can see some sampled clock speeds. That one ran at 4.4 GHz, while the 395prototype was slower at 4.29 (the released version will clock higher).
While its overall scores were comparable to the 3950X, in SGEMM and SFT it was 62.9 and 10.4 GFLOPS vs 98.8 amd 13.5 GFLOPS.
( I suspect the jump is smaller than that provided by avx512, because avx512 does the number of registers on top of doubling their width, letting you use larger kernels, reducing the ratio of move/fma instruction ratio.)

cscherrer · June 20, 2019, 5:09pm

julia> @benchmark directexp!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     17.050 μs (0.00% GC)
  median time:      17.120 μs (0.00% GC)
  mean time:        17.461 μs (0.00% GC)
  maximum time:     49.566 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark cobexp!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     23.517 μs (0.00% GC)
  median time:      25.084 μs (0.00% GC)
  mean time:        27.077 μs (0.00% GC)
  maximum time:     114.213 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark syspow!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.135 μs (0.00% GC)
  median time:      15.856 μs (0.00% GC)
  mean time:        16.032 μs (0.00% GC)
  maximum time:     56.076 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark jsleefpow!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.094 μs (0.00% GC)
  median time:      16.188 μs (0.00% GC)
  mean time:        16.904 μs (0.00% GC)
  maximum time:     76.149 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark csleefpow!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.834 μs (0.00% GC)
  median time:      12.207 μs (0.00% GC)
  mean time:        13.216 μs (0.00% GC)
  maximum time:     63.291 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark xsimdpow!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.904 μs (0.00% GC)
  median time:      6.982 μs (0.00% GC)
  mean time:        7.218 μs (0.00% GC)
  maximum time:     26.691 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> @benchmark jsleefpowcob!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.863 μs (0.00% GC)
  median time:      8.010 μs (0.00% GC)
  mean time:        8.409 μs (0.00% GC)
  maximum time:     28.134 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     4

julia> @benchmark csleefpowcob!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.140 μs (0.00% GC)
  median time:      5.274 μs (0.00% GC)
  mean time:        5.662 μs (0.00% GC)
  maximum time:     25.680 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     6

julia> @benchmark xsimdpowcob!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.515 μs (0.00% GC)
  median time:      5.578 μs (0.00% GC)
  mean time:        5.756 μs (0.00% GC)
  maximum time:     17.309 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     6

Yifan_Liu · August 25, 2019, 9:46pm

I am considering buying AMD’s $749 Ryzen 9 3950X next month. Will the 16 cores help the performance of packages like JuMP and DifferentialEquations?

cscherrer · August 25, 2019, 10:34pm

I don’t know about JuMP, but the DiffEq FAQ has some information about this that might be helpful

tkoolen · August 25, 2019, 10:43pm

For JuMP, the first step to answering that question is determining whether most of your time is spent in the problem formulation (actual JuMP/MathOptInterface/solver wrapper Julia code) or inside the particular solver you’re using. Most likely it’s the latter, in which case the answer is of course solver-dependent. But generally, if you’re solving mixed-integer programs, then most solvers can exploit multiple cores. A lot of algorithms for solving LPs and QPs, as well as gradient-based nonlinear optimization are harder to parallelize.

Yifan_Liu · August 25, 2019, 10:57pm

I am working on a paper that applies Bayesian analysis to a high frequency foreign exchange price data. The data has a size of 100 gigabytes. I guess I need a 128 GB ram and a good CPU to do that.

cscherrer · August 26, 2019, 12:45am

You could maybe stream it so it’s never all in memory at once. But more memory is always helpful.

What Bayesian method are you thinking?

Yifan_Liu · August 26, 2019, 1:37am

I am thinking about using MCMC algorithm.

cscherrer · August 26, 2019, 1:39am

Hamilton Monte Carlo, or something else? There are a few HMC options in Julia

Topic		Replies	Views
How to choose a workstation for optimal performance Offtopic question , hardware	51	5204	November 13, 2021
Thinking about buying a multicore system of ebay. Would appreciate any thoughts or experiences Offtopic multithreading	40	1953	January 7, 2020
Show off Julia performance on your PC! Performance	53	4326	April 26, 2020
Help wanted: benchmarking multi-threaded CPU performance Offtopic hardware	20	933	May 13, 2024
Disappointing benchmark results with AMD Threadripper PRO 3975WX 32 Cores Performance performance , multithreading	10	844	June 6, 2023

Workstation advice (for mostly Julia use)

Related topics