Show off Julia performance on your PC!

mthelm85 · April 17, 2020, 1:23pm

Here’s mine:

julia> @time peakflops(16_000)
 27.067492 seconds (16 allocations: 3.815 GiB, 1.28% gc time)
3.140591528004417e11

I also checked out AcuteBenchmarks which is awesome. I ran the first example and got this:

Elrod · April 17, 2020, 1:32pm

I’ll second what semeonschaub says. My motherboard isn’t on that list either.
The ethernet port on the motherboard doesn’t work, so I have my ethernet cable plugged into a dongle, plugged into a usb slot.

I’d have tried installing Linux first, I’d be a little surprised if you run into problems.

This blog post discusses how they optimized python libraries like sci-kit-learn using profile-guided optimization and also compiling multiple versions for different architectures.

The Julia system image already does function multi-versioning, as do the most important/sensitive libraries like BLAS or FFTW.
Maybe some of the other libraries that Julia depends on, like GMP and MPFR, would benefit from supporting more recent instruction sets.

For the most part, it probably doesn’t make that big a difference because the libraries where the authors put a lot of effort into making sure the code takes advantage of SIMD (like OpenBLAS/MKL and FFTW) will already use multi-versioning by default.

But in some cases, like Clear Linux’s python, it does make a big difference.

It’s an AMD GPU (Vega 64). You can make out the “R” and “EON” parts of of “RADEON”.

I’d heard Clear Linux recently made it easier to install proprietary software like Chrome. I haven’t looked into that yet. They have documentation describing how to install NVIDEA’s drivers.

Ha, thanks! Moving it is impractical.

I’m a little surprised. This is with OpenBLAS 0.3.9, which seems to finally have good AVX-512 support.
It actually beat MKL in my single-threaded benchmarks (above roughly 200x200), and looked like it was getting close in multi-threaded by 10,000 x 10,000.

You could try something like this in both MATLAB and Julia

M = K = N = 16_000
A = rand(M, K); B = rand(K, N); # perhaps precompile with smaller sizes, like `M = K = N = 500` first.
t = @elapsed A * B;
flops = 2M * K * N / t

I’ll leave the MATLAB version to you.
You could file an issue with MKL.jl.

Probably what you can expect.

I confess most of what it’s done is just run lots of benchmarks in parallel (I’ve set single and all-core clock speeds to be equal / disabled all boosting, so speed on a given core should be mostly independent of what any of the other cores are doing [barring things like contention over L3 cache]).

mthelm85 · April 17, 2020, 2:35pm

Is it necessary to build from source or is it possible to simply install the generic Linux binary? This is uncharted territory for a long-time Windows user

daniel · April 17, 2020, 2:59pm

For what it’s worth, here are my numbers for a 3950x slightly overclocked to 4.2GHz allcore and running with 3600MHz CL14 memory.

using LinearAlgebra #openblas
BLAS.set_num_threads(1)
peakflops(16_000)

4.7525320641043976e10

BLAS.set_num_threads(32)
peakflops(16_000)

7.247416073580671e11

Elrod · April 17, 2020, 3:33pm

Official Linux binaries are fine.
Are you familiar with the command line?

@daniel
The 3950X can’t achieve 369 GFLOPS single threaded.
Can you double check that number / that it really corresponds to a single thread?
If the number is correct, that’d imply something like strassen.

I’m not sure why, but on my computer

julia> ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())
8

The default is to start with 8 threads. Maybe your first run was with 8 threads?
Your second run was probably with 16, since OpenBLAS builds with Julia allow up to 16 threads by default.

I edited my blas.mk when building Julia to increase the limit to 18 (the number of physical cores on this machine):

julia> using LinearAlgebra

julia> BLAS.set_num_threads(18)

julia> ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())
18

julia> BLAS.set_num_threads(36)

julia> ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())
18

julia> Sys.CPU_THREADS
36

julia> BLAS.set_num_threads(1)

julia> BLAS.vendor()
:openblas64

julia> @time peakflops(16_000)
 68.089648 seconds (12 allocations: 3.815 GiB, 0.09% gc time)
1.208110293882327e11

mthelm85 · April 17, 2020, 3:37pm

Are familiar with the command line?

I have very limited experience with Linux commands (limited to fooling around with a Raspberry Pi and then Ubuntu years ago on an old laptop).

I’m pretty sure I have an unused external hard drive laying around somewhere so this weekend I think I’ll try to get a dual-boot setup going to see if I can get everything set up/installed the way that I like it.

daniel · April 17, 2020, 4:13pm

Oops, you’re totally right. Already wondered about the poor scaling. Edited the post above.
The second run seems to be with 32 threads, at least

ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())

shows 32, even with the precompiled Julia binary.

RoyiAvital · April 17, 2020, 4:23pm

You better set the number of Threads to the number of Physical cores and not the number of threads with SMT enabled.
This is what we saw in the past (Though most information came form Intel CPU’s, so it might be different here).

Elrod · April 17, 2020, 4:41pm

blas.mk sets at most 16 as the upper limit, as does OpenBLASBuilder. I don’t know how the Julia binaries get built, nor where their linked OpenBLAS’s build script is, but it seems it’s neither of those.

I also thought the same as Royi, which is why I set the maximum possible number of threads to 18 instead of 36.

daniel · April 17, 2020, 5:05pm

Yeah I was surprised by this as well. There seems to be a slight speedup going from 16 to 32 threads which is consist between runs, atleast when calling peaksflops.

julia> BLAS.set_num_threads(16)

julia> peakflops(16_000)
7.12436321023842e11

julia> BLAS.set_num_threads(32)

julia> peakflops(16_000)
7.285140776841587e11

and htop shows 32 running threads.

RoyiAvital · April 17, 2020, 6:00pm

This is interesting.
It is known that AMD’s SMT implementation is more efficient than Intel.
So your results could be example of that.

Another option might be the kernel doesn’t squeeze all resources available in a single thread on a physical CPU, hence you get some more performance. It might suggest that OpenBLAS can have more optimized kernels for Ryzen (Based on Zen 2 Core). Just a guess…

Modatu · April 18, 2020, 6:38am

I do not want to hijack this thread, but does one of you know of a guide to squeeze all the performance out of julia during install/compilation?
For example do you compile your own OpenBLAS or do you let Julia handle this?

kristoffer.carlsson · April 18, 2020, 10:54am

OpenBLAS dynamically picks specialized kernels for the runtime architecture so there isn’t an advantage to compile it locally.

peterj · April 26, 2020, 9:21pm

This thread got a bit off-topic, but I figured I would try the tests in this comment. I have an MSI laptop that is a few years old now:

julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)    
  CPU: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = "C:\Users\peter\AppData\Local\Programs\Microsoft VS Code\Code.exe"
  JULIA_NUM_THREADS = 6

with N = 10^8

julia>  @btime sequential_add!($y, $x)
  13.872 s (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)   
  4.312 s (33 allocations: 4.98 KiB)

with N = 2^27

julia>  @btime sequential_add!($y, $x)
  21.121 s (0 allocations: 0 bytes)   

julia> @btime parallel_add!($y, $x)   
  6.310 s (34 allocations: 5.00 KiB)

Topic		Replies	Views
OpenBLAS is faster than Intel MKL on AMD Hardware (Ryzen) Performance blas , lapack	40	36486	June 19, 2020
Help wanted: benchmarking multi-threaded CPU performance Offtopic hardware	20	933	May 13, 2024
Julia faster on Mac OSX than in linux? New to Julia	23	3731	April 28, 2018
Workstation advice (for mostly Julia use) Offtopic question	22	3178	December 25, 2020
Disappointing benchmark results with AMD Threadripper PRO 3975WX 32 Cores Performance performance , multithreading	10	847	June 6, 2023

Show off Julia performance on your PC!

Related topics