Show off Julia performance on your PC!

Here’s mine:

julia> @time peakflops(16_000)
 27.067492 seconds (16 allocations: 3.815 GiB, 1.28% gc time)
3.140591528004417e11

I also checked out AcuteBenchmarks which is awesome. I ran the first example and got this:

I’ll second what semeonschaub says. My motherboard isn’t on that list either.
The ethernet port on the motherboard doesn’t work, so I have my ethernet cable plugged into a dongle, plugged into a usb slot.

I’d have tried installing Linux first, I’d be a little surprised if you run into problems.

This blog post discusses how they optimized python libraries like sci-kit-learn using profile-guided optimization and also compiling multiple versions for different architectures.

The Julia system image already does function multi-versioning, as do the most important/sensitive libraries like BLAS or FFTW.
Maybe some of the other libraries that Julia depends on, like GMP and MPFR, would benefit from supporting more recent instruction sets.

For the most part, it probably doesn’t make that big a difference because the libraries where the authors put a lot of effort into making sure the code takes advantage of SIMD (like OpenBLAS/MKL and FFTW) will already use multi-versioning by default.

But in some cases, like Clear Linux’s python, it does make a big difference.

It’s an AMD GPU (Vega 64). You can make out the “R” and “EON” parts of of “RADEON”.

I’d heard Clear Linux recently made it easier to install proprietary software like Chrome. I haven’t looked into that yet. They have documentation describing how to install NVIDEA’s drivers.

Ha, thanks! Moving it is impractical.

I’m a little surprised. This is with OpenBLAS 0.3.9, which seems to finally have good AVX-512 support.
It actually beat MKL in my single-threaded benchmarks (above roughly 200x200), and looked like it was getting close in multi-threaded by 10,000 x 10,000.

You could try something like this in both MATLAB and Julia

M = K = N = 16_000
A = rand(M, K); B = rand(K, N); # perhaps precompile with smaller sizes, like `M = K = N = 500` first.
t = @elapsed A * B;
flops = 2M * K * N / t

I’ll leave the MATLAB version to you.
You could file an issue with MKL.jl.

Probably what you can expect.

I confess most of what it’s done is just run lots of benchmarks in parallel (I’ve set single and all-core clock speeds to be equal / disabled all boosting, so speed on a given core should be mostly independent of what any of the other cores are doing [barring things like contention over L3 cache]).

5 Likes

Is it necessary to build from source or is it possible to simply install the generic Linux binary? This is uncharted territory for a long-time Windows user :grimacing:

For what it’s worth, here are my numbers for a 3950x slightly overclocked to 4.2GHz allcore and running with 3600MHz CL14 memory.

using LinearAlgebra #openblas
BLAS.set_num_threads(1)
peakflops(16_000)

4.7525320641043976e10

BLAS.set_num_threads(32)
peakflops(16_000)

7.247416073580671e11

3 Likes

Official Linux binaries are fine.
Are you familiar with the command line?

@daniel
The 3950X can’t achieve 369 GFLOPS single threaded.
Can you double check that number / that it really corresponds to a single thread?
If the number is correct, that’d imply something like strassen.

I’m not sure why, but on my computer

julia> ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())
8

The default is to start with 8 threads. Maybe your first run was with 8 threads?
Your second run was probably with 16, since OpenBLAS builds with Julia allow up to 16 threads by default.

I edited my blas.mk when building Julia to increase the limit to 18 (the number of physical cores on this machine):

julia> using LinearAlgebra

julia> BLAS.set_num_threads(18)

julia> ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())
18

julia> BLAS.set_num_threads(36)

julia> ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())
18

julia> Sys.CPU_THREADS
36

julia> BLAS.set_num_threads(1)

julia> BLAS.vendor()
:openblas64

julia> @time peakflops(16_000)
 68.089648 seconds (12 allocations: 3.815 GiB, 0.09% gc time)
1.208110293882327e11
1 Like

Are familiar with the command line?

I have very limited experience with Linux commands (limited to fooling around with a Raspberry Pi and then Ubuntu years ago on an old laptop).

I’m pretty sure I have an unused external hard drive laying around somewhere so this weekend I think I’ll try to get a dual-boot setup going to see if I can get everything set up/installed the way that I like it.

Oops, you’re totally right. Already wondered about the poor scaling. Edited the post above.
The second run seems to be with 32 threads, at least

ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())

shows 32, even with the precompiled Julia binary.

You better set the number of Threads to the number of Physical cores and not the number of threads with SMT enabled.
This is what we saw in the past (Though most information came form Intel CPU’s, so it might be different here).

blas.mk sets at most 16 as the upper limit, as does OpenBLASBuilder. I don’t know how the Julia binaries get built, nor where their linked OpenBLAS’s build script is, but it seems it’s neither of those.

I also thought the same as Royi, which is why I set the maximum possible number of threads to 18 instead of 36.

Yeah I was surprised by this as well. There seems to be a slight speedup going from 16 to 32 threads which is consist between runs, atleast when calling peaksflops.

julia> BLAS.set_num_threads(16)

julia> peakflops(16_000)
7.12436321023842e11

julia> BLAS.set_num_threads(32)

julia> peakflops(16_000)
7.285140776841587e11

and htop shows 32 running threads.

1 Like

This is interesting.
It is known that AMD’s SMT implementation is more efficient than Intel.
So your results could be example of that.

Another option might be the kernel doesn’t squeeze all resources available in a single thread on a physical CPU, hence you get some more performance. It might suggest that OpenBLAS can have more optimized kernels for Ryzen (Based on Zen 2 Core). Just a guess…

1 Like

I do not want to hijack this thread, but does one of you know of a guide to squeeze all the performance out of julia during install/compilation?
For example do you compile your own OpenBLAS or do you let Julia handle this?

OpenBLAS dynamically picks specialized kernels for the runtime architecture so there isn’t an advantage to compile it locally.

1 Like

This thread got a bit off-topic, but I figured I would try the tests in this comment. I have an MSI laptop that is a few years old now:

julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)    
  CPU: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = "C:\Users\peter\AppData\Local\Programs\Microsoft VS Code\Code.exe"
  JULIA_NUM_THREADS = 6

with N = 10^8

julia>  @btime sequential_add!($y, $x)
  13.872 s (0 allocations: 0 bytes)

julia> @btime parallel_add!($y, $x)   
  4.312 s (33 allocations: 4.98 KiB)

with N = 2^27

julia>  @btime sequential_add!($y, $x)
  21.121 s (0 allocations: 0 bytes)   

julia> @btime parallel_add!($y, $x)   
  6.310 s (34 allocations: 5.00 KiB)

1 Like