Workstation advice (for mostly Julia use)

BLAS/LAPACK are much faster with avx512, as is native if you’re willing to vectorize all the bottlenecks. Besides double-width vectors, it also offers twice the registers (reducing register pressure), and efficient masking, which can make vectorizing with the likes of SIMD easier.
Although masked instructions are about as efficient as their unmasked counterparts, unfortunately no compiler and very few libraries take advantage of them. Some of mine do, which is why PaddedMatrices.jl – which uses masking for unpadded matrices – was about 3x or more faster than Eigen for most small statically sized (unpadded) matrices.

Last I tested, BLAS/LAPACK only benefit from avx512 if you’re using MKL, and not if you’re using OpenBLAS.

Unfortunately, the cheapest avx512 cpu I see from a quick search is a pre-owned 6-core 7800X for $300 on ebay. That’s 50% more than the Ryzen 3600. The Ryzen has higher clock speeds, and less than half the TDP.

For the CPU, unless you’re super excited about vectorization, the new Ryzens look like much better deals.
Old Ryzen’s did have half-rate 256 bit fma throughput, which is bad for numerics and BLAS/LAPACK in particular. The 3600 & Co are full-rate.

EDIT:
My 9940X GeekBench vs a prototype of the upcoming 16-core Ryzen 3950X that made the news recently as “record setting”.
While my CPU came out behind in the multithread score (unless I overclocked), the single threaded SGEMM and SFFTs performed much better, at 200.3 and 18.3 GFLOPS vs 98.8 and 13.5 GFLOPS.
So in the particular tasks I spend most of my time on, it does perform better.
Then again, the 3950X will debut for not much over half the cost of the 9940X, and at higher clock speeds than the GeekBenched part…

4 Likes

I know @Tamas_Papp is not in the UK. If anyone in the UK is looking for a custom built workstation I would recommend a company I used to work for. I admit though that they build gaming PCs with all the nice cases and lights, and VR rigs. MEssage me offline for a contact.

Thanks for the advice — unfortunately they are not (yet) available in retail in Europe, even if the CPU is nominally “released”. So I have more time to plan.

Does this have any practical consequence for the RAM I should choose? I plan to go with a G.SKILL 32GB Aegis DDR4 3000MHz CL16 KIT (F4-3000C16D-32GISB). I figured that CL15 or faster RAM clock would not make a huge difference for me, on a B450 chipset.

There is this slide from AMD regarding RAM speed:

(Picture is taken from the Zen 2 deep dive at anandtech]

2 Likes

I mean for a company like AMD they sure make shitty graphs…

2 Likes

I use 3–5 threads. Lot of branches,

My ideal future proof machine will be:

The X570 chipset is only important, if you want lighting fast Gen4.0 i/o speed - in the future.
The 3x M.2 is important, for data intensive works

Why?

  • Julia 1.2(1.3) Big improvements will be the real multithreading
    so the thread count will be important ! Later more and more julia packages will be Auto-scaling to use all available threads …
  • I like the gnu parallel :slight_smile:
5 Likes

I ended up with a ThreadRipper 2950X with a Gigabyte X399 Designare motherboard. Multithreading is pretty great (16 cores, 32 threads), but it does suffer a bit in sequential floating point, I think from missing out on MKL optimizations. Not sure how similar that is to the 3600, but I’d be happy to run any Julia benchmarks if that would be informative for you.

1 Like

The ZEN2 ( Ryzen 3600 ) AVX2 is much better
“The key highlight improvement for floating point performance is full AVX2 support. AMD has increased the execution unit width from 128-bit to 256-bit, allowing for single-cycle AVX2 calculations, rather than cracking the calculation into two instructions and two cycles. This is enhanced by giving 256-bit loads and stores, so the FMA units can be continuously fed. AMD states that due to its energy aware scheduling, there is no predefined frequency drop when using AVX2 instructions (however frequency may be reduced dependent on temperature and voltage requirements, but that’s automatic regardless of instructions used)”

3 Likes

They’ll be released July 7th in the US (7/7 for the 7nm parts).

The 7nm Ryzen clearly beat earlier chips, especially for numerical workloads.
I think they also look like much better choices than all non-avx512 intel parts, which is why I focused on them.

If you don’t mind installing a bunch of unregistered libraries, you could try benchmarking the vectorized pow functions here, or small matrix multiplication like in the “3x or more faster than Eigen” link.

Here is a 2950X GeekBench result that did very well. By adding .gb4 to the end of the urls, you can see some sampled clock speeds. That one ran at 4.4 GHz, while the 395prototype was slower at 4.29 (the released version will clock higher).
While its overall scores were comparable to the 3950X, in SGEMM and SFT it was 62.9 and 10.4 GFLOPS vs 98.8 amd 13.5 GFLOPS.
( I suspect the jump is smaller than that provided by avx512, because avx512 does the number of registers on top of doubling their width, letting you use larger kernels, reducing the ratio of move/fma instruction ratio.)

1 Like
julia> @benchmark directexp!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     17.050 μs (0.00% GC)
  median time:      17.120 μs (0.00% GC)
  mean time:        17.461 μs (0.00% GC)
  maximum time:     49.566 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark cobexp!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     23.517 μs (0.00% GC)
  median time:      25.084 μs (0.00% GC)
  mean time:        27.077 μs (0.00% GC)
  maximum time:     114.213 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark syspow!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     15.135 μs (0.00% GC)
  median time:      15.856 μs (0.00% GC)
  mean time:        16.032 μs (0.00% GC)
  maximum time:     56.076 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark jsleefpow!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.094 μs (0.00% GC)
  median time:      16.188 μs (0.00% GC)
  mean time:        16.904 μs (0.00% GC)
  maximum time:     76.149 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark csleefpow!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.834 μs (0.00% GC)
  median time:      12.207 μs (0.00% GC)
  mean time:        13.216 μs (0.00% GC)
  maximum time:     63.291 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark xsimdpow!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.904 μs (0.00% GC)
  median time:      6.982 μs (0.00% GC)
  mean time:        7.218 μs (0.00% GC)
  maximum time:     26.691 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> @benchmark jsleefpowcob!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.863 μs (0.00% GC)
  median time:      8.010 μs (0.00% GC)
  mean time:        8.409 μs (0.00% GC)
  maximum time:     28.134 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     4

julia> @benchmark csleefpowcob!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.140 μs (0.00% GC)
  median time:      5.274 μs (0.00% GC)
  mean time:        5.662 μs (0.00% GC)
  maximum time:     25.680 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     6

julia> @benchmark xsimdpowcob!($c, $a, $b)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.515 μs (0.00% GC)
  median time:      5.578 μs (0.00% GC)
  mean time:        5.756 μs (0.00% GC)
  maximum time:     17.309 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     6
2 Likes

I am considering buying AMD’s $749 Ryzen 9 3950X next month. Will the 16 cores help the performance of packages like JuMP and DifferentialEquations?

I don’t know about JuMP, but the DiffEq FAQ has some information about this that might be helpful

2 Likes

For JuMP, the first step to answering that question is determining whether most of your time is spent in the problem formulation (actual JuMP/MathOptInterface/solver wrapper Julia code) or inside the particular solver you’re using. Most likely it’s the latter, in which case the answer is of course solver-dependent. But generally, if you’re solving mixed-integer programs, then most solvers can exploit multiple cores. A lot of algorithms for solving LPs and QPs, as well as gradient-based nonlinear optimization are harder to parallelize.

9 Likes

I am working on a paper that applies Bayesian analysis to a high frequency foreign exchange price data. The data has a size of 100 gigabytes. I guess I need a 128 GB ram and a good CPU to do that.

1 Like

You could maybe stream it so it’s never all in memory at once. But more memory is always helpful.

What Bayesian method are you thinking?

1 Like

I am thinking about using MCMC algorithm.

Hamilton Monte Carlo, or something else? There are a few HMC options in Julia

1 Like

I already have the sas code of the mcmc methodology i want to use, and it is from 2002. I kind of remember HMC is newer than that, right? I think i will just do the translation from sas to julia. If i need more advanced stuff, i will take look at hmc.

MCMC is of course embarrassingly parelellisable, but usually one does not run more than 5 chains.

At home, I’ve been using an AMD Ryzen 3900X with 32GB of RAM. It has been working fantastically for Julia parallel processing. I’m very happy with it (was on sale last year for ~$400). It is about twice as fast as the Intel i5 8th Gen laptop I was issued by my work. Both the 3900X and the 5900X score well on PassMark’s price to value chart (PassMark CPU Value Chart - Performance / Price of available CPUs).

I think 24 threads is a sweet spot for developing parallel programs. If you are processing large problems on a daily basis, then something bigger might be required depending on the value of your time and how fast you need to turn around projects. While I would like a Threadripper 3990X, the cost is very high (~$4,000), so I would have to be getting a tremendous value from the speed up. Maybe if I was researching a cure or vaccine, then the speed would be justified.

4 Likes