OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen)

Oh, I didn’t know that. Is there an easy way to check?

There’s a -p option when you start Julia, is that what you mean? I didn’t think I needed this, since I set the environment variable. Am I misunderstanding this?

The docs say,

If the underlying BLAS is using multiple threads, higher flop rates are realized. The number of BLAS threads can be set with BLAS.set_num_threads(n) .

If the keyword argument parallel is set to true , peakflops is run in parallel on all the worker processors. The flop rate of the entire parallel computer is returned. When running in parallel, only 1 BLAS thread is used. The argument n still refers to the size of the problem that is solved on each processor.

It doesn’t make sense to me why parallel=true would force single-threaded BLAS, but ok :slight_smile:

That was without:

 julia> BLAS.set_num_threads(16)

julia> LinearAlgebra.peakflops(16000)
3.5702446000519916e11

julia> BLAS.set_num_threads(32)

julia> LinearAlgebra.peakflops(16000)
3.293593157654745e11

Me too!

Correct on the first point. On the second, is that a general rule? Didn’t realize that. I’ve heard of people getting the best results with n-1 threads so one could still watch the mouse, etc. But I forget whether n was physical or logical in this case.

Just saw the additional details you both gave, that makes it much
clearer.

Good point, I haven’t even looked into overclocking

Oh, I didn’t know that. Is there an easy way to check?

You can check the build:


If you don’t mind compiling Julia from source, you could also change the line for your OS.

There’s a -p option when you start Julia, is that what you mean? I didn’t think I needed this, since I set the environment variable. Am I misunderstanding this?

Ah, yes, that is what I meant. Which environmental variable?
Also, docs on Julia master:

help?> LinearAlgebra.peakflops
  LinearAlgebra.peakflops(n::Integer=2000; parallel::Bool=false)

  peakflops computes the peak flop rate of the computer by using double precision gemm!. By default, if no arguments are specified, it multiplies a matrix of size n x n, where n = 2000. If the underlying BLAS is using multiple threads, higher flop rates are realized. The number of BLAS threads can be set with BLAS.set_num_threads(n).

  If the keyword argument parallel is set to true, peakflops is run in parallel on all the worker processors. The flop rate of the entire parallel computer is returned. When running in parallel, only 1 BLAS thread is used. The argument n still refers to the size of the problem that is solved on each processor.

"When running in parallel, only 1 BLAS thread is used. " I imagine the worker processes are the ones you get from -p or addprocs.

Good point, I haven’t even looked into overclocking

I just moderately increase the clock speed, boot, launch Julia and try

X = rand(10^4,10^4);
@time inv(X);
@time inv(X);
@time inv(X);
@time foreach(inv, (X for _ in 1:100));

If the overclock is bad, the computer will normally crash the instant you hit “enter” on the first @time inv(X). If it doesn’t crash, watch the temperatures (I use watch -n0.5 sensors) and speed (watch -n1 "cat /proc/cpuinfo | grep MHz") and see how high the temperatures get, and make sure it’s actually running at the specified speed. Sometimes it’ll heat up and crash or throttle.
Make sure it’s at a temperature you’re comfortable with.

If things crash, you can try and increase voltage. You’d have to look up what is safe.
Also, (obviously) better safe than sorry. It’s not worth risking the computer crashing while you’re actually using it.

I also set XMP settings for memory, so that the RAM actually runs at the advertised speeds. I haven’t bothered to overclock beyond that yet, but it’d be a similar process. Maybe through in some large dot products / vector sums that are definitely memory bottle necked.

Also, my single thread benchmarks are actually worse now. Stock, the boost speed changes based on the number of cores working. I was lazy, and set the same speed for working cores (inactive cores still run much slower to save power). So I don’t think a single core boosts quite as high.
But when I’m actually doing real work, all the cores are busy, so it’s not something I worry too much about. It’d just be cool to see great times when running single threaded @benchmarks. :wink:

3 Likes

Found nice trick to maximize MKL performance on AMD CPU’s based on Agner Fog findings: Anaconda on Windows with MKL Support.

Pay attnetion to the compilation trick: https://github.com/fo40225/Anaconda-Windows-AMD/blob/master/site.cfg. Just incorporating Agner Fog’s CPU Dispatch.

Should be interesting to see comparison with this trick applied.

I’ve got to watch this space over the next year - I’ll want to put together a machine that is superfast at running Julia, that will also serve as a high powered gaming machine for my kids (has to be able to run the upcoming Microsoft Flight Simulator 2020 at high frame rates, at at least 4K), as well as allow me to play with the GPUs with Julia.

How high is your budget? Does it include the HEDT line ups?

IIRC, you’ve done a lot of low level / explicit SIMD for text processing?
Sounds like you’d know how much/if you’d benefit from avx512.
Thanks to strong competition from AMD, the new Cascade Lake-X CPUs (with avx512) cost half as much as the previous generation. Eg, $1000 for the 18 core part, which if over clocked should be able to hit around 2 teraflops with LinearAlgebra.peakflops(16000), like the much older 7980xe. The 10 core part is $600.

If you won’t benefit from avx512, the AMD parts will probably be better. The Ryzen 3900X is 12 cores for $500, and 3950X will be 16 for $750. I have no idea about the new threadrippers. These chips do better on most benchmarks than Cascade Lake in single core performance relative to clock speed.

Julia is good at generating avx512 code, but lots of software, even software that could theoretically make good use of it like OpenBLAS, do not. Based on reviews, I don’t think games benefit. You’d know about your string libraries.

I could run benchmark scripts if it doesn’t take much work to set up.

@Royi, that would be pretty cool if someone can get that to work, especially with the new 7nm parts.

1 Like

Do you mean the patch for MKL?
Well, one need to build it form the lib files in with that. It works as you can see in the repository.